<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_week8_Probabilistic_Model_NaiveBayes_allMarkdown.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications

## Week 8: Probabilitic Model and Naive Bayes
### Prof. Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

# A probabilistic classifier:
- Given an observation of an input
- Predict a probability distribution over a set of classes
- rather than only outputting the most likely class that the observation should belong to.

## For example,
- Given an Email as a text document:
 - $Pr(Spam|Email)$ = 0.7
 - $Pr(Not\_Spam|Email)$| = 0.3


# Bayes' Theorem in the Context of Data Mining

\begin{equation}
P(H \mid D) = \frac{P(D \mid H) P(H)}{P(D)}
\end{equation}

- $P(H∣D)$: Posterior Probability: The probability of hypothesis
$𝐻$ (e.g., a model or pattern being true) given the observed data $𝐷$.

- $P(D∣H)$: Likelihood: The probability of the data occurring given that hypothesis $𝐻$. In data mining, this represents how well the data supports a specific model.

- $P(H)$: Prior Probability: The initial belief about hypothesis
$𝐻$ before observing the data. In data mining, this may come from domain knowledge or historical patterns.

- $P(D)$: Evidence (Marginal Probability of Data): The overall probability of observing the data, regardless of which hypothesis is true. This acts as a normalizing factor.


# Probability Rules for Two Events \( A \) and \( B \)

## 1. Mutually Exclusive Events
Two events \( A \) and \( B \) are **mutually exclusive** if they cannot occur together. That is:

$$
P(A \cap B) = 0
$$

Using the addition rule:

$$
P(A \cup B) = P(A) + P(B)
$$

## 2. Not Mutually Exclusive Events
If \( A \) and \( B \) are **not mutually exclusive**, they can occur together. The general addition rule applies:

$$
P(A \cup B) = P(A) + P(B) - P(A \cap B)
$$

## 3. Independent Events
Two events \( A \) and \( B \) are **independent** if the occurrence of one does not affect the probability of the other. This means:

$$
P(A \cap B) = P(A) P(B)
$$

## 4. Dependent Events
If \( A \) and \( B \) are **dependent**, the probability of one event depends on the occurrence of the other. The conditional probability rule applies:

$$
P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A)
$$


# An Example
- Sentimental analysis: classify whether a review is positive or negative
 - “The author is making big money” (positive)
 - “Irony but fascinating” (positive)
 - “don’t waste money on it” (negative)

- Is “money wasted, fascinated” positive or negative?

```
s = {"The author is making big money":1, "Irony but fascinating":1,
"don’t waste money on it":0}
s
```

## Import the libraries for tokenization and normalization

```
import spacy
from collections import Counter
from nltk.stem import PorterStemmer

# Load spaCy's English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")
```

## Tokenize, normalize, and count word frequencies

```
stemmer = PorterStemmer()

word_freq_pos = Counter()
word_freq_neg = Counter()

for sentence in s:
    doc = nlp(sentence.lower())  # Convert to lowercase for uniformity
    if s[sentence] == 1:
        words = [stemmer.stem(token.lemma_.lower()) for token in doc if token.is_alpha]  # Keep only words
        word_freq_pos.update(words)
    else:
        words = [stemmer.stem(token.lemma_ .lower()) for token in doc if token.is_alpha]  # Keep only words
        word_freq_neg.update(words)
```

```
word_freq_pos, word_freq_neg
```

## Conditional Word Probabiliies
- Compute the conditional probabily of each word given a class
- Given a word $w$,
 \begin{equation}
 Pr(w| pos) = \frac{frequency\ of\ w\ in\ all\ positive\ cases}{total\ number\ of\ positive\ cases}
 \end{equation}

 \begin{equation}
 Pr(w| neg) = \frac{frequency\ of\ w\ in\ all\ negative\ cases}{total\ number\ of\ negative\ cases}
 \end{equation}.

```
word_prob_pos = {}
word_prob_neg = {}
```

```
for word, count in word_freq_pos.items():
    word_prob_pos[word] = count / sum(word_freq_pos.values())
for word, count in word_freq_neg.items():
    word_prob_neg[word] = count / sum(word_freq_neg.values())
```

```
word_prob_pos, word_prob_neg
```

## Naive Bayes Classification
- Tokenize and normalize the given sentence instance $s$.
- Compute the conditional probabilies, $Pr(pos|s)$ and $Pr(neg|s)$.
- Apply Bayes' Theorem:
\begin{equation}
    Pr(pos|s) = \frac{Pr(s|pos)\times Pr(pos)}{Pr(s)}
\end{equation}
\begin{equation}
    Pr(pos|s) = \frac{Pr(s|pos)\times Pr(pos)}{Pr(s)}
\end{equation}
- Apply word independency assumption:
\begin{equation}
    Pr(s|pos) = Pr(w1|pos)\times Pr(w2|pos)\times...\times Pr(w_{m}|pos)
\end{equation}
\begin{equation}
    Pr(s|neg) = Pr(w1|neg)\times Pr(w2|neg)\times...\times Pr(w_{m}|neg)
\end{equation}

- Make comparison:
\begin{equation}
    Pr(pos|s) \sim Pr(s|pos)
\end{equation}
\begin{equation}
    Pr(neg|s) \sim Pr(s|neg)
\end{equation}

```
ss = "money wasted, fascinated"
doc = nlp(ss.lower())
words = [stemmer.stem(token.lemma_.lower()) for token in doc if token.is_alpha]
words
```

```
Pr_1 = 1
for word in words:
    if word in word_prob_pos:
        Pr_1 *= word_prob_pos[word]
Pr_1
```

```
Pr_0 = 1
for word in words:
    if word in word_prob_neg:
        Pr_0 *= word_prob_neg[word]
Pr_0
```

```
# normalize the probabilities
Pr_1_norm = Pr_1 / (Pr_1 + Pr_0)
Pr_0_norm = Pr_0 / (Pr_1 + Pr_0)
Pr_1_norm, Pr_0_norm
```

## Retrieval Practice

# Use Scikit Learn Naive Bayes Classifier

```
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
```

## Upload and read the text data

```
files.upload()
```

```
sms = pd.read_csv("spam.csv", encoding='latin-1')
sms.head()
```

```
sms.shape
```

### Label Distribution

```
sms.v1.value_counts()/sms.shape[0]
```

## Create a tokenizer using spacy

```
# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens
```

```
spacy_tokenizer(sms.loc[4].v2)
```

## Vectorization
- We will convert labels to 1 or 0 such that spam=1 and ham=0
- We are going to use Bag of Words(BoW) to convert text into numeric format.
- BoW converts text into the matrix of occurrence of words within a given - document. It focuses on whether given word occurred or not in given document and generate the matrix called as BoW matrix/Document Term Matrix
- We are going to use sklearn's CountVectorizer to generate BoW matrix.
- In CountVectorizer we will use custom tokenizer 'spacy_tokenizer' and - ngram range to define the combination of adjacent words. So unigram means sequence of single word and bigrams means sequence of 2 continuous words.
- Likewise, n means sequence of n continuous words.
- In this example we are going to use unigram, so our lower and upper bound of ngram range will be (1,1)

```
from sklearn.feature_extraction.text import CountVectorizer
```

```
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
```

```
# Convert all text into vectors
X = bow_vector.fit_transform(sms.v2)
```

```
X.shape
```

```
X.todense()[:2]
```

```
bow_vector.vocabulary_
```

```
# Convert class label to numeric 1 or 0
y = sms.v1.map({'spam':1, 'ham':0})
y
```

## Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

## Let us build a Naive Bayes Classifier

```
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
cls = MultinomialNB()
```

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
scores
```

```
np.mean(scores)
```

## Test the classifier

```
cls.fit(X_train, y_train)
```

```
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

## Retrieval Practice

```

```