# 07 Naive Bayes
A ranking classifier is a classifier that can rank a test set in order of confidence for a given classification outcome.  
Naive Bayes is a ranking classifier because the ‘probability’ can be used as a confidence measure for ranking.
1. Train a Naive Bayes classifier from the `AthleteSelection` data. Use `GaussianNB`.
2. Load the test data from `AthleteTest.csv` and apply the classifier. 
3. Use the `predict_proba` method to find the probability of being selected. 
4. Rank the test set by probability of being selected.  
    4.1. Who is most likely to be selected?  
    4.2. Who is least likely?  


In [4]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.metrics import confusion_matrix 

In [5]:
athlete = pd.read_csv('AthleteSelection.csv',index_col = 'Athlete')
athlete.head()

Unnamed: 0_level_0,Speed,Agility,Selected
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
x1,2.5,6.0,0
x2,3.75,8.0,0
x3,2.25,5.5,0
x4,3.25,8.25,0
x5,2.75,7.5,0


In [6]:
y = athlete.pop('Selected').values
X = athlete.values

In [7]:
gnb = GaussianNB()
bnb = BernoulliNB()
mnb = MultinomialNB()
ath_NB = gnb.fit(X,y)
y_dash = ath_NB.predict(X)

In [8]:
confusion = confusion_matrix(y, y_dash)
print("Confusion matrix:\n{}".format(confusion)) 

Confusion matrix:
[[12  0]
 [ 1  7]]


In [9]:
print(y)
print(y_dash)

[0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1]


## Test Data 

In [10]:
ath_test = pd.read_csv('AthleteTest.csv',index_col = 'Athlete')
ath_test

Unnamed: 0_level_0,Speed,Agility
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1
t1,3.3,8.2
t2,4.5,4.5
t3,5.5,7.2
t4,3.8,8.8
t5,5.5,5.2
t6,8.1,7.8
t7,7.7,5.2
t8,6.1,5.5
t9,5.5,6.0
t10,6.1,5.5


We can apply the classifier directly to the data frame. 

In [11]:
X_test = ath_test.values
y_test = ath_NB.predict(X_test)
y_test

array([0, 0, 1, 0, 1, 1, 1, 1, 1, 1], dtype=int64)

In [12]:
ath_test

Unnamed: 0_level_0,Speed,Agility
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1
t1,3.3,8.2
t2,4.5,4.5
t3,5.5,7.2
t4,3.8,8.8
t5,5.5,5.2
t6,8.1,7.8
t7,7.7,5.2
t8,6.1,5.5
t9,5.5,6.0
t10,6.1,5.5


1. Use `predict_proba` to get probabilities of being selected.  
2. Store these probabilities as a new column in the data frame.
3. Sort the data frame by this column. 

In [13]:
y_probs = ath_NB.predict_proba(X_test)
ath_test['Prob']=y_probs[:,1]   
ath_test.sort_values(by=['Prob'], ascending=False, inplace = True)
ath_test

Unnamed: 0_level_0,Speed,Agility,Prob
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
t6,8.1,7.8,0.999997
t7,7.7,5.2,0.999945
t8,6.1,5.5,0.972931
t10,6.1,5.5,0.972931
t3,5.5,7.2,0.911933
t9,5.5,6.0,0.854283
t5,5.5,5.2,0.799833
t4,3.8,8.8,0.150478
t2,4.5,4.5,0.122983
t1,3.3,8.2,0.041314
