# Module 29 Topic Review - Naive Bayes Classification

## Gaussian Naive Bayes Function
" it allows you to develop a belief network taking into account all of the available information regarding the scenario"

The bayesian formula for probability conditioned on *multiple* independent events is defined as follows:  
$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$

### 1: Preprocess data (clean data, split train-test samples) 
```Python 
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('filepath.csv')

y = df['target']
X = df.drop('target',axis=1)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
```

### 2: Calculate $\mu$ and $\sigma$ for each feature. 

```Python
train_df = pd.concat([X_train,y_train],axis=1)
train_agg_df = train_df.groupby('target').agg(['mean','std'])
train_agg_df
```  
<img src='images/agg_table.png'>  




### 3: Calculate conditional probability point estimates for each feature per target class. . .    
$$ \large \text{point estimate for i'th feature} = P(x_i|y) = \frac{1}{\sqrt{2 \pi \sigma_i^2}}e^{\frac{-(x-\mu_i)^2}{2\sigma_i^2}}$$ 

```Python
def p_x_given_class(data:pd.DataFrame, feature:str, obs_idx:int, class_=0):
    headers = list(data.columns)
    headers.remove(feature)
    agg_df = data.groupby(feature).agg(['mean','std'])

    mu = train_agg_df[feature]['mean'][class_]
    sig = train_agg_df[feature]['std'][class_]

    Pxy_list = []

    for i in range(len(data)):
        obs = data.iloc[i][feature]
        p_x_given_y = stats.norm.pdf(obs, loc=mu, scale=sig)
        Pxy_list.append(p_x_given_y)

    return Pxy_list[obs_idx]

point_est = p_x_given_class(train_df,'age',0)
point_est

0.035036938123834606
```

### 4: Predict class for a given observation;  
- For each possible class, take the product of the point estimates for the observation and multiply them by the probability of the given class. Take the class with the greatest probability

```Python
def predict_class(obs_row):
    c_probs = []
    for c in range(2):
        # Initialize probability to relative probability of class
        p = len(y_train[y_train == c])/len(y_train) 
        for feature in X.columns:
            p *= p_x_given_class(obs_row, feature, c)
        c_probs.append(p)
    return np.argmax(c_probs)

predict_class(X_train.iloc[0])

0
```

 ### 5: Evaluate the model using some chosen evaluation metric.  

 ```Python
y_hat_train = [predict_class(X_train.iloc[idx]) for idx in range(len(X_train))]
y_hat_test = [predict_class(X_test.iloc[idx]) for idx in range(len(X_test))]

residuals_train = y_hat_train == y_train
acc_train = residuals_train.sum()/len(residuals_train)

residuals_test = y_hat_test == y_test
acc_test = residuals_test.sum()/len(residuals_test)
print('Training Accuracy: {}\tTesting Accuracy: {}'.format(acc_train, acc_test))

Training Accuracy: 0.8502202643171806	Testing Accuracy: 0.8289473684210527
 ```

## Document Classification with Naive Bayes

### 1: Preprocess data
Load, clean and train-test split just like above.

```Python
# Load data
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'text'])
df.head()
```
<img src='images/word_table.png' height = 176>

```Python
# deal with class imbalance
p_classes = dict(df.label.value_counts(normalize=True))
p_classes

{'ham': 0.8659368269921034, 'spam': 0.13406317300789664}

minority = df[df['label']=='spam']
majority_undersampled = df[df['label']=='ham'].sample(n=len(minority))

undersampled_df = pd.concat([minority,majority_undersampled])

undersampled_df['label'].value_counts()

ham     747
spam    747
Name: label, dtype: int64

# train-test split sampling
X = undersampled_df['text']
y = undersampled_df['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=17)
train_df = pd.concat([X_train,y_train],axis=1)
test_df = pd.concat([X_test,y_test],axis=1)
```

### 2: Discover word frequency for each target class

```Python
class_word_freq = {} 
classes = train_df['label'].unique()

for class_ in classes:
    temp_df = train_df[train_df.label == class_]
    bag = {}

    for row in temp_df.index:
        doc = temp_df['text'][row]

        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
            
    class_word_freq[class_] = bag

class_word_freq

{'ham': {'Its': 7,
  'ok,': 1,
  'if': 26,
  'anybody': 1,
  'asks': 1,
  'abt': 4,
  'me,': 3,
  'u': 85,
  'tel': 1,
  'them..:-P': 1,
  'Thanx': 2,
  '4': 27,
  'e': 15,
  'brownie': 1,
  "it's": 7,
  'v': 2,
  'nice...
```

### 3: Get the total corpus word count

```Python
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)

V
5904
```

### 4: Implement Naive Bayes algorithm

```Python
# create a bag of words function
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]
```

### 5: Evaluate classifier performance

```Python
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)
```