# Pair Problem: Wookiee neighbors

Your rebel alliance team has been stranded on a remote planet in the Outer Rim. The memory banks of your ship have been wiped. You are the only surviving data scientist on the team.

Your location is near several planets that are largely inhabited by wookiees. Unfortunately, the different tribes of wookiees have different attitudes toward the alliance. It's important that your team know which tribe (represented by the color of the wookiee) will be on a planet or ship before exiting warp nearby. If you end up near a hostile wookiee tribe, you may not have time to reactive your warp drive before things turn sour.

The problem is that your databank is fried. Out of millions of ships and planets, you only know the location and color of a few hundred wookiee tribes.

Your team turns to you, the only one on the team that is capable of harnessing the power of The Force (data science). However, all of your tools were destroyed in the memory bank failure: no neural networks, no random forests, no models of any kind.

# Details

You must code, from scratch, a working KNN classification algorithm. Use the train-test split below to evaluate your model and then generate predictions for each of the observed wookiee ships in the holdout set. If you get some time, compare your results to those of the `sklearn.cluster.KMeans` classifier. What classification metric is most important to us here?

- [train data](http://soph.info/metis/nyc18_ds15/wookiee-train.csv)
- [test data](http://soph.info/metis/nyc18_ds15/wookiee-test.csv)
- [holdout data](http://soph.info/metis/nyc18_ds15/wookiee-ho.csv)


# Possible extensions:

 * Does your solution work for any number of features in the training data sets?
 * Does your solution handle ties (equidistance)?
 * Can you add another parameter, `k`, to your solution so that it uses the `k` nearest Wookiees instead of only the nearest Wookiee?
 * Can you add to your solution so that it has reasonable behavior if `y_train` is numeric?

An extension of another kind:

 * Are you confident that your solution is correct? How can you ensure that it is, and check that it stays correct in the future?

In [7]:
import pandas as pd 
import numpy as np 

In [5]:
wookie_train = pd.read_csv("wookiee-train.csv")
wookie_ho = pd.read_csv("wookiee-ho.csv")
wookie_test = pd.read_csv("wookiee-test.csv")

In [6]:
wookie_train.head()

Unnamed: 0.1,Unnamed: 0,wookieecolor,xcoord,ycoord,zcoord
0,0,red,-3.410692,0.8544,0.228154
1,1,red,0.35008,-0.75112,-1.845183
2,2,chartreuse,0.841712,-0.058204,0.246217
3,3,red,-0.64626,-1.821082,0.444616
4,4,blue,1.423538,2.269409,-1.061053


In [47]:
wookie_train.wookieecolor.value_counts()

red           292
white         222
blue          151
chartreuse     85
Name: wookieecolor, dtype: int64

In [25]:
wookie_train.shape

(750, 5)

In [20]:
def knn(x,y,z,dataframe):
    """
    
    """
    for wookie in range(0,len(dataframe)):
        distances = np.sqrt((x - wookie.xcoord)**2 + (y - wookie.ycoord)**2 + (z - wookie.zcoord)**2)
        min_distance = min(distances)
    
    
    
    return min_distance

In [38]:
distances_list = []
for wookie in wookie_train.itertuples():
    
    distance = np.sqrt((1 - wookie.xcoord)**2 + (1 - wookie.ycoord)**2 + (1 - wookie.zcoord)**2)
    distances_list.append(distance)


min_distance = min(distances_list)

print(min(distances_list))

nearest_wookie_index = distances_list.index(min_distance)

print(wookie_train.iloc[nearest_wookie_index].wookieecolor)


0.22946232580389325
blue


In [48]:
from collections import defaultdict

def knn_wookie(x,y,z,k):
    
    color_dict=defaultdict(int)
    for wookie in wookie_train.itertuples():
    
        distance = np.sqrt((x - wookie.xcoord)**2 + (y - wookie.ycoord)**2 + (z - wookie.zcoord)**2)
        distances_list.append(distance)

    sorted_distance = sorted(distances_list)
    closest_distances = sorted_distance[:k]
    #min_distance = min(distances_list)
    #print(min(distances_list))
    
    
    nearest_wookie_indexes = [distances_list.index(i) for i in closest_distances ]
    
    wookie_color_list = [wookie_train.iloc[i].wookieecolor for i in nearest_wookie_indexes]
    
    for color in wookie_color_list:
        color_dict[color]+=1
    
    
        
    print(wookie_color_list)
    
    print(color_dict)
    
    print(max(color_dict,key=color_dict.get))

In [50]:
knn_wookie(1,1,1,10)

['blue', 'blue', 'blue', 'blue', 'blue', 'chartreuse', 'chartreuse', 'chartreuse', 'chartreuse', 'chartreuse']
defaultdict(<class 'int'>, {'blue': 5, 'chartreuse': 5})
blue
