# ***Naive Bayes Classifier***
------------------------------------

In [2]:
# Suppose you have three classes to classify web search queries into

# Zoology
# Computer Science
# Entertainment

In [3]:
# In general most web search queries fall under the entertainment category.
# When you know nothing about a query, the chances for it being entertainment related is quite high.

In [4]:
# Consider the search query "Python"

# Is it Python the snake? (Zoology)
# Python programming language (Comp Sci)
# Or the Monty Python show (entertainment)

# How would you classify this query?


In [5]:
# Without having any information about a query, we would have assumed it to be entertainment related due to sheer chance.
# Now that we do know the query we do know that it is more likely related to the Python snake than entertainment.

In [7]:
# Assume that the search query has changed to "Python 3.8 download"

# We now know for a fact that this is related to Python programming language.
# Given additional information about the context, one can better classify the query into a more appropriate category.
# i.e Computer Science

In [8]:
# We initially had a probabilistic model
# That stated that most web search queries are likely to be entertainment related.
# But when we became aware of the query string, the context changed.
# We updated the probability of the query being Zoology related higher since the string was "Python"

# Then with the arrival of more information
# Such as the version number and "download" we knew that this query was about Python language
# So, we update the probabilities again
# This time giving the Computer Science class the highest probability.

## ***Prior probability: P(Y = Entertainment), P(Y = Zoology), P(Y = Computer science)***

## ***Prior probability: P(Y = Entertainment) > P(Y = Zoology) or P(Y = Computer science)***

In [None]:
# These are the probabilities before having any clue about the actual query string.
# Applies to any random search query purely based on learned past knowledge.

## ***P(Y = Entertainment) + P(Y = Zoology) + P(Y = Computer science) = 1***

In [9]:
# Since we have just these three classes, the sum of these probabilities should be 1.0

In [10]:
# Say, now that we know that the query is "Python"

## ***P(Y = Entertainment | x = "Python")***

In [13]:
# Now this has become the probability of the query string being entertainment related, given the query string is "Python"

In [14]:
# So, now

## ***P(Y = Entertainment | x = "Python") or P(Y = Zoology | x = "Python") < P(Y = Comp. Sci | x = "Python")***

# ***Baye's theorem***

## ***Posterior probability = $\frac{Prior probability \text{ x } Likelihood}{Evidence}$***

# ***$P(Y | x)$ = $\frac{P(Y) \times P(x | Y)}{P(x)}$***

In [2]:
# Probability of Y given x is equal to
# The prior probability of Y P(Y) multiplied by
# the probability of x given Y P(x | Y)
# where the data is the class
# i.e given that the label is Computer Science, the probability of seeing Python there.

# divided by the probability of x

## ***$P(Y = Comp. Sci | x = "Python") = \frac{P(Comp. Sci.)~\times~P(x = "Python" | Y = Comp. Sci.)}{P("Python")}$***

In [21]:
# So the probability of the class being Computer science given that the query string is "Python" equals to
# The prior probability of the class Computer science x
# Probability of finding the string "Python" in a Computer science document (Probability of the query "Python" given the class Computer science) x
# Probability of getting a query string "Python"

In [22]:
# However, Naive Bayes Classification model is just interested in which of the three labels is more likely.
# Which class has the highest probability is important
# Their exact probability values are less important.

In [23]:
# So the true (predicted) label Y_hat is the Y that maximizes the P(Y | x)
# e.g. P(Y = Comp. Sci. | x = "Python 3.9.11")
# Here Y_hat is the label (class) Comp. Sci. that maximizes P(Y | x) 

# ***$P(Y | x) = \frac{P(Y)~\cdot~P(x | Y)}{P(x)}$***

In [24]:
# Here, P(x) does not matter
# Since the probability of seeing a query like "Python" has no influence on which class that query will be classified into.
# One can safely ignore the P(x) part.

# ***$\hat{Y}$ = argmax P(Y | x) = argmax P(Y) x P(x | Y)***

In [25]:
# Naive assumption ->

# Given the class label, features are assumed to be independent of one another.
# e.g. Given the label Y, P(x | Y) is only dependent on the given feature X.

In [26]:
# Final formulation of Naive Bayes classifier
# x_i here enumerates the features in the web search query.

# ***$\hat{Y} =~argmax~P(Y | x)~=~argmax~P(Y)~\times~\pi_{i = 1}^{n}{P(x_{i} | Y)}$***

In [27]:
# Predicted label Y_hat is the label (class) that maximizes the probability of label Y given feature x. P(Y | x)

In [28]:
# If the web query is "Python 3.10 Windows 11"

# ***$\hat{Y}~=~argmax~P(Y)~\times~P("Python" | Y) x P("3.10" | Y)~\times~P("Windows 11" | Y)$***

In [32]:
# Probability of this nquery being Zoology related is very low.


In [33]:
# The naive Bayes classifier is only interseted in which class has the highest probability.
# Y will belong to one of the following classes: Zoology, Computer Science, Entertainment
# The exact probability values of the query strings per-se have very little importance.
# i.e. P("Python") has no significance
# P("Python") marks the probability of seeing the string "Python" in a random web search query.


So, from the following equation, we can remove the red part!

# ***$P(Y | x) = \frac{P(Y)~\cdot~P(x | Y)}{\color{red}{P(x)}}$***

and make it as,

# ***$\hat{Y} = \text{argmax}~P(Y | x) = P(Y)~\times~P(x | Y)$***

In [34]:
# Naive Bayes classifier assumes that there is nore relationships between features.
# e.g. in the query "Python 3.9.11 Windows 11"
# Probabilities of all individual words occurring are independent.
# Which is not true since the version string and platform have a high likelihood of co-occurring with "Python"
# than occurring independently.

# ***$\hat{Y} =~argmax~P(Y | x)~=~argmax~P(Y)~\times~\pi_{i = 1}^{n}{P(x_{i} | Y)}$***

## $\hat{Y}$ - The value of Y that maximizes $P(Y | x)$
## Which is computed as the products of the probabilities if individual features given Y.

In [35]:
# Consider the query "Python 3.11.2 download"

# ***$\hat{Y}~=~argmax~P(Y)~\times~P(Python | Y)~\times~P(3.11.2 | Y)~\times~P(download | Y)$***

In [36]:
# If Y is Zoology
# P(Zoology) is typically low
# P(Python | Zoology) will be high
# P(3.11.2 | Zoology) will be very low
# P(download | Zoology) will also be low.

# If Y is Computer Science
# P(Computer Science) will be low
# P(Python | Computer Science) will be high.
# P(3.11.2 | Computer Science) will be high.
# P(download | Computer Science) will also be high.
# So, of all classes Computer Science will be the class to give the argmax of P(x | Y)

# ***Parameters***
--------------------

In [37]:
# In Naive Bayes model the Y_hat is the product of individual probabilities.
# What are the parameters there?

## ***$Prior~probabilities:~~P(y)~for~y~in~Y$***
## ***$Likelihoods:~P(x_{i} | y)~for~all~features~x_{i}~and~labels~y~in~Y$***

In [38]:
# If there are 5 classes
# |Y| = 5

# and 250 features
# |x| = 250
# e.g. x goes from x_0, x_1, x_2 .... x_249

# How many parameters does a Naive Bayes model will have?

print(f"Number of parameters: {5 * 250}")

Number of parameters: 1250


# ***Learning parameters***
-----------------------

In [48]:
# Let's say that you have the following 5 labels
# Movies, Music, Zoology, Computer Science, Data Science

# How do you propose that Movies or Music related queries have higher probabilities?

# You will have a training dataset
# With a set of web queries labelled with their cognate classes

# e.g. a query "Rust stable release" labelled "Computer Science"
# "Jennifer Lopez - On The Floor ft. Pitbull" labelled "Music"
# "Anatomy of reptilian brain" labelled "Zoology"
# "Marry the Night - Lady Gaga" labelled "Music"
# "Fast and Furious 6" labelled "Movies"
# et.c.

# We can simply count the number of instances of all classes in the training data to compute those classess's prior probabilities.
# Assume that the above examples are your training data (just 5 queries with their labels)

print(f"Prior probability of class {'Music'} is {2 / 5}")
print(f"Prior probability of class {'Computer Science'} is {1 / 5}")
print(f"Prior probability of class {'Movies'} is {1 / 5}")
print(f"Prior probability of class {'Zoology'} is {1 / 5}")

Prior probability of class Music is 0.4
Prior probability of class Computer Science is 0.2
Prior probability of class Movies is 0.2
Prior probability of class Zoology is 0.2
