# Naive Bayes : Gaussian

Naive Bayes model is based off of the Bayes therorm which talks about the relationship between the probability of an event happening & the observations of that same event happening previously in recorded data. 

### $$ P(A|B) = \frac{P(B|A)\;P(A)}{P(B)} $$

### Probabilistic Classifier

#### Three Naive Bayes Models
- Bernoulli
    - Used mostly when features contain binary data, 0s & 1s
- Multinomial 
    - Most frequently used when features of data are whole integers
- Gaussian
    - Used when data is continous and assumed that distribution of data is normal

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB , MultinomialNB

from sklearn.metrics import confusion_matrix

  from numpy.core.umath_tests import inner1d


In [8]:
X = pd.read_csv('../00_Data/X.csv')
y = pd.read_csv('../00_Data/y.csv')

In [10]:
df = pd.read_csv('../00_Data/df_complete.csv')
df.fillna(value = 0 , inplace  = True)

#### Data visualization

In [9]:
print(X.shape , y.shape)

(85267, 643) (85267, 1)


## Train test split

- This is done so that we can later test to make sure that our model is able to predict accurately on unseen data. 
 - Training on more data is always better so in this case I am only saving 30% of my data for testing. 
 - Shuffle= True   : This Randomly grabs rows from our data frame and randomly puts them in either the train set of the test set. I am doing this to make sure that the data is random and so that my model won't be overfit to a specific string of data
 
##### Unbalanced Data:   
   - Negative Class = 96%
   - Positive Class = 4%'

##### Stratify = y :
   - Since my classes are unbalanced & I am randomly selecting if a row goes into the train or test set, it is possible that the majority of my positive class could end up in the test set & my model will do horrible becuase it will not have much data on my positive class to train on. Stratifying my data will make sure that there is an even number of postive class rows in both my training and testing set.

In [11]:
X_train ,X_test ,y_train, y_test = train_test_split(X,y,
                                                     random_state = 42 , 
                                                    train_size = .7,
                                                   stratify=  y)



### Pipeline:
Pipelines are a way to run multiple processes in the order that they are listed. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. In the model below, we set Standard Scaler and Guassian Naive Bayes.

In [12]:
pipe = Pipeline([
       ('ss',StandardScaler()),
       ('GNB',GaussianNB())
])

### Hyper Parameters:
 These are the parameters that we want our model to train on, feeding these parameters through a gridsearch will train a model on every combination of my parameters and output the best model. 
> - Guassian Naive Bayes only has one hyperparameter and it is called 'priors'. 'Priors' are the probabilities of each case in a class. Unfortunatly I have not priors, so I set the parameter to None. 

In [13]:
params= {
    'GNB__priors': [None],
}

In [14]:
y_ravel = y_train.values.ravel()

### Grid Search:
- Grid searching is a module that performs parameter tuning which is the process of selecting the values for a model’s parameters that maximize the accuracy of the model. Grid Search does this by fitting every combination of parameters and selecting the best parameters by which model had the best score.


#### Scoring:
- I am scoring with 'roc_auc' which stands for "Reciever operating Curve ,  Area Under the Curve". The reason I am scoring this instead of scoring with accuracy is because of how unbalanced my classes are. We could predict zero for every data point, and could have a 96% accuracy score because 96% of the data is in our negative class. Using roc_auc accounts for the true positives and true negatives that we predict.

In [15]:
GNB_gs = GridSearchCV(pipe, param_grid= params, scoring = 'roc_auc', cv = 3)
GNB_gs.fit(X_train, y_ravel);

In [16]:
GNB_gs.score(X_train , y_ravel ) , GNB_gs.score(X_test , y_test )

(0.5812016261556112, 0.5725153379614374)

### Predictions

In [17]:
predictions = GNB_gs.predict(X_test)

### Probability of  Predictions

In [18]:
predict_prob = GNB_gs.predict_proba(X_test)

### Confusion Matrix
   
   ##### True Positve:
   - A True Positive is the cases where my Model Predicts a positive class & is correct on that prediction
   
##### True Negative:
  - A True Negative is the cases where my Model Predicts a negative class & is correct on that prediction
​
​
​

*We want to optimize for these predictions because that means my model can really predict the differnece between classes



In [20]:
confusion_matrix(y_test , predictions)

array([[ 4222, 20415],
       [   25,   919]])

### Data Frame for my Predictions compared with True Classes

In [21]:
pred_df = pd.concat([pd.DataFrame(predictions , columns = ['Predictions']), 
           df.award_binary, pd.DataFrame(predict_prob , columns = ['Probability_no_award ', 'Probabiliy_award'])] ,
         axis = 1)

 ## Summary

Using a Gaussian Naive Bayes model in my opinion I thought it would be good for my data. It performed better then logistic regression for recall but precision was atrocious. It predicted 3116 songs that won awards correctly, but also predicted 68,000 other songs for my positive class incorrectly. I used 'roc_auc' scoring which this model performed 58% on train & 57% on test. Possible conclusions to why is that our data is not the right Distribution for Gaussian.