**Section 6: Naive Bayes**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.0, May 28 2024


`import` statements required for this notebook.

In [None]:
import numpy as np
import pandas as pd

# 1. Naive Bayes Classifier: First Steps

In this notebook we solve the exercise from the Self Study Assignment. Given is the following table of observations:

| A | B | C | Class |
| --- | --- | --- | ------ |
| 0 | 0 | 0 | + |
| 0 | 0 | 1 | - |
| 0 | 1 | 1 | - |
| 0 | 1 | 1 | - |
| 0 | 0 | 1 | + |
| 1 | 0 | 1 | + |
| 1 | 0 | 1 | - |
| 1 | 0 | 1 | - |
| 1 | 1 | 1 | + |
| 1 | 0 | 1 | + |




For this exercise we replaced class `+` by `1` and class `-` by `0` and stored it in a `.csv` file named `exerciseBayes.csv`. 

In [None]:
df=pd.read_csv('data/exerciseBayes.csv')
print(df)

Now let's create a numpy array that only contains the data:

In [None]:
data=np.array(df)
print(data)

For the Bayes Classifier, we need to count occurences and determine probabilities. So first the total number of rows, i.e. observations:

In [None]:
totalNb=data.shape[0]
print('totalNb:',totalNb)

Occurences of class 1 and class 2:

In [None]:
nbClass1=(data[:,3]==1).sum()
print('nbClass1:',nbClass1)
nbClass0=(data[:,3]==0).sum()
print('nbClass0:',nbClass0)

And we determine the probabilities for the two classes:

In [None]:
probClass1=nbClass1/totalNb
print('probClass1:',probClass1)
probClass0=nbClass0/totalNb
print('probClass0:',probClass0)

And now we need to determine the conditional probabilities. There are two issues that might make this task a bit tedious:
1. we need many Python variables (Three attributes with each two values for 2 classes ==> 12 different conditonal probilities!) or need to be careful not to confuse values when we use a list.
2. while we can use one condition to determine index vectors (also used in `(data[:,3]==1).sum()`) we cannot use more than one condition. Therefore, we would need reduction. Or we go through all the ovservations to count occurences but this is confusing to program with the different variables....

The Python concepts we know so far are thus not very supportive to our task.

We will now introduce a new data structure, the `dictionary` that helps us in counting the occurences and determining the required probabilities.

# 2. Data Types: The Dictionary `dict`

A **dictionary**, or **map** as in some programming languages, allows for the storage of **key-value-pairs**.
The keys must have unique values. 

## 2.1 Definition of a Dictionary

An empty dictionary is defined using empty curly bracktes `{}` or `dict()`.

In [None]:
d1={}
d2=dict()

print(type(d1))
print(type(d2))

When defining a non-empty  dictionary the key-value pairs are specified using a `:` as `k:v` where `k` is the key and `v` is the value. 

In [None]:
d3={1:2,3:4}
print(d3)

## 2.2 Access 

Given is the following dictionary:

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
print(example)

As you can see, keys as well as values can have different data types. 

The values of the dictionary are accessed via the key and the  `[]` operator:

In [None]:
print('Value at key "s1":',example['s1'])
print('Value at key 1   :',example[1])
print('Value at key 3.4 :',example[0.34e+1])

A dictionary is a mutable data type, i.e. its values can be modfied. Using the access via the key, a value can be modified with a simple assignment: 

In [None]:
example['s1']=17

In case no value exists for the specified key, a new key-value pair is inserted into the dictionary:

In [None]:
print(example)
example[2]=123
print(example)

The operator `in` tests, whether a specified key exists in the dictionary. It cannot be used to test for the existence of values: 

In [None]:
print(example)
print("in 'text':",'text' in example)
print("in '8':",'8' in example)

## 2.3 Printing a Dictionary

A dictionary can be output using a `for` loop. A `for` loop requires a so-called iterable. There are three methods to obtain an iterable from a dictionary: 
1. `items()`: creates an iterable of key-value pairs
2. `keys()`: creates an iterable of keys
3. `values()`: creates an iterble of values

**Output using `keys()`**

`keys()` creates an iterable of keys:

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
for k in example.keys():
    print(k)

Using the keys the values can be equally printed:

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
for k in example.keys():
    print(k,'with value',example[k])

**Output using `items()`**

In order to print key-value pairs, the `items()` method can be used:

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
for i in example.items():
    print(i)

`items()` returns a tuple that can be unpacked:

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
for s,w in example.items():
    print(s,'is the key and',w,'is the value')

**Output using `values()`**

A last method to create an iterable from a dictionary is `values()`.

In [None]:
example={1:12,'s1':13,'text':17,-4:12,3.4:"8"}
for v in example.values():
    print(v)

**Please note:** there is no method to retreive the key of a value (as different keys might have the same value). If you wish to output key-value pairs, you need to use the method `items()` or `keys()`.   

But of course we can equally print a dictionary using a simple `print()` statement.

In [None]:
print(example)

## 2.4 Example: Counting the Words in a Text

Let's assume we have a text and want to count the words. To make things easier we assume that there are no punctuation marks and no upper case words.

In [None]:
theText="this is a sample text to demonstrate the use of a dictionary in a notebook we do not use punctuation marks or any specific things in this text for the notebook I hope you will have some fun and can learn something from the notebook"

We first need to create a list of the single words in the text. We can do so by calling the method `split()` of the class `str`:

In [None]:
wordList=theText.split()
print(wordList)

Now we create a dictionary to count the words.

In [None]:
wordCounter={}

We now iterate through the list of words. We test whether the current word is a key of the dictionary using the `in` operator. If not, we create a new entry in the dictionary and set the counter value to 1. Otherwise we increment the counter associated with the word.

In [None]:
for word in wordList:
    if word in wordCounter.keys():
        wordCounter[word]=wordCounter[word]+1
    else:
        wordCounter[word]=1

and output the result:

In [None]:
for word,number in wordCounter.items():
    print(f"'{word}' occures {number}-times")

# 3. Example from the self study: "Learning" the Classifier

## 3.1 Introductory Example

As a first example simply let's count the samples belonging to each class and store those numbers in a dictionary. The classes are `1` and `0` and thus we define the dictionary as follows: 

In [None]:
dic1={1:0,0:0}

The classes are in the fourth column `class` with index 3. We use a for loop to look at all data samples and increment the counter for the class:

In [None]:
for row in data:
    dic1[row[3]]+=1
print (dic1)
    

Now let's expand this example and count the values for the attribute A (first column, index 0) for each of the classes. We use now a two-level dictionary

In [None]:
dic2={1:{0:0,1:0},0:{0:0,1:0}}

The first level indicates the class, the second the value of attribute A.

`dic2[1][0]` contains therefore the number of data samples belonging to class 1 with a value 0 for attribute A.

In [None]:
for row in data:
    dic2[row[3]][row[0]]+=1
print (dic2)
    

This code starts looking a bit cryptic, so we could add another level to the dictionary:

In [None]:
dic3={1:{'A':{0:0,1:0}},0:{'A':{0:0,1:0}}}

In [None]:
for row in data:
    dic3[row[3]]['A'][row[0]]+=1
print (dic3)

We can now enhance this to count the values of all attributes.

## 3.2 Counting the samples

We will now use a three level dictionary to count values:

We use the classes as keys for the dictionary. Our dictionary has two keys, the key `1` and `0` representing the classes.
The corresponding values are equally dictionaries representing the attributes, i.e. the features `A`, `B` and `C`. To count the values of each feature, we create a third level of a dictionary with the keys `1` (the corresponding attribute has the value 1) and `0`, the corresponding attribute has the value 0. The values of this dictionary are integer numbers, initialized with 0, to count the number of occurences.

The data structure is defined as follows:

In [None]:
count={1:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}},0:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}}}

Let's take a look at the data structre and print it:

In [None]:
print(count)

We now create a second dictionary, to count the occurences of the classes (we already know, that both classes occur 5 times):

In [None]:
classCount={1:0,0:0}

And a variable to count the total number of occurences (10).

In [None]:
totalNumber=0

Now we define one more variable, a list, to make access to the dictionary (its keys) easier:

In [None]:
attributeNames=['A','B','C']

`A` in `attributeNames` is at index `0` which "happens" to be the index of the column in the array `data`.

Now we can use a for loop to look at each occurence, and increment the corresponding numbers in the dictionary for the different attribute values: 

In [None]:
# go through the array row by row
for row in data:
    # the class (1 (+) or 0 (-) ) is stored in the fourth column at index 3
    theClass=row[3]
    # now we go through the three columns with the attribute values, i is the 
    # index of the columns
    for i in range(len(attributeNames)): 
        # first let's retrieve the attribute name (index i of the names list)
        theAttribute=attributeNames[i]
        # and the value of the attribute (in index i)
        theValue=row[i]
        # and now we use theClass, theAttribute and theValue as the keys for the 
        # three level dictionary and increment the counter
        count[theClass][theAttribute][theValue]+=1
    # after we finished looking at all attribute values, 
    # we finished one occurence and incremented the corresponding counter 
    # for the class
    classCount[theClass]+=1
    # and for the total numbers
    totalNumber+=1

Let's print the result:

In [None]:
print(count)

You may check it by looking at the table of the self study assigment. We did a simple count, nothing special. The dictionary helped us to easily address the respective field. No magic, or BAM!., as StatQuest uses to say.

## 3.3 Determining the Probabilities

In the next step we need to calculate probabilities:

Let's define a dictionary for the class probabilities:

In [None]:
classProbability={1:0,0:0}

Just to avoid hard coding things, let's make a list with the class names:

In [None]:
classNames=[0,1]

We can now use a `for` loop to initialize the class probabilities (with two classes this seems a bit too complex, but we might have more classes). The probability of class $c_i$ is determined by

$$ prob_{c_i}=\frac{nb_{c_i}}{nb_{all}} $$

where $nb_{c_i}$ is the number of occurences of class $c_i$ and $nb_{all}$ the number of all occurences:

In [None]:
for cl in classNames:
    classProbability[cl]=classCount[cl]/totalNumber

And let's check it:

In [None]:
print(classProbability)

In order to determine the conditional probabilities, we must divide the number of occurences in the `count` variable by the number of occurences of the respective class. In order to go through the whole dictionary, we define an additional list for the attribute values: 

In [None]:
attributeValues=[0,1]

As well as a dictionary for the probabilies (with analog code as for `count`):

In [None]:
probability={1:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}},
             0:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}}}

and can now use for loops to calcuate the probabilities. Please **complete** the following code (that currently sets all probabilities to the number of counted samples): 

In [None]:
for cl in classNames:
    for a in attributeNames:
        for v in attributeValues:
            # replace the following assignment
            probability[cl][a][v]=count[cl][a][v]

Please `print()` the dictionary and check some values.

In [None]:
print(probability)

# 4. Classification

In order to classify an instance using the Naive Bayes classifier we calculate the following value for each class $c$ based on the values $v_i$ of all $k$ attributes:

$$ p(c) \prod_{i=1}^k p(v_i|c) $$

where $p(c)$ is the probability of the class and $p(v_i|c)$ the conditional probability of the attribute value $v_i$.

Let's assume that the occurence is defined in a dictionary where the attribute names are the keys. Let's further assume, the class proababilizies and the probabilities are defined in dictionaries as above. We can thus define the following classification function:

In [None]:
def classify(newValue,classProb,condProb):
    # newValue is a dictionary describing the sample
    # classProb is the dictionary with the class probabilities
    # condProb is the three level dictionary with the conditional probabilities (class, attributes, values)
    result={'Class':0,'prob':0}
    for cl in classProb.keys():
        p=classProb[cl]
        for a in newValue.keys():
            p=p*condProb[cl][a][newValue[a]]
        print('Class:',cl,'has the probability of',p)
        if p>result['prob']:
            result['Class']=cl
            result['prob']=p
    return result
    

Now let's test our classification with some tuples of the table:

1. A tuple that exists only once in the table

In [None]:
values={'A':0,'B':0,'C':0}
print("The classification should be 1")
result= classify(values,classProbability,probability)
print("The resulting class is",result['Class'],'with probability:',result['prob'])

2. A tuple that has 2 occurences with the same class:

In [None]:
values={'A':0,'B':1,'C':1}
print("The classification should be 0")
result= classify(values,classProbability,probability)
print("The resulting class is",result['Class'],'with probability:',result['prob'])

3. A tuple that has three occurences with different classes. 

In [None]:
values={'A':1,'B':0,'C':1}
print("The tuple values have once class 1 and twice class 0")
result= classify(values,classProbability,probability)
print("The resulting class is",result['Class'],'with probability:',result['prob'])

The tuple is not classified as expected. Please check the probability values to validate above result.

4. A new tuple

In [None]:
values={'A':0,'B':1,'C':0}
print("This tuple is new")
result= classify(values,classProbability,probability)
print("The resulting class is",result['Class'],'with probability:',result['prob'])

And here we have now a problem. In the original data we did not have any sample, that has class `0` and a value of `0` for `C`. Therefore the conditional probability of $p(C=0|0)$ is 0.

# 5. Improved Classifier

In order to overcome such problems, we can add (as equally proposed in the video) a 1 ($\alpha$) to all counts. We may do this during the definition of `count`:

In [None]:
countImproved={1:{'A':{0:1,1:1},'B':{0:1,1:1},'C':{0:1,1:1}},0:{'A':{0:1,1:1},'B':{0:1,1:1},'C':{0:1,1:1}}}

Please note, that the total number of values as well as the count of the class values must be corrected accordingly. In each class we added a count of 6 and must thus define the values as:

In [None]:
classCountImproved={1:6,0:6}
totalNumberImproved=12

We can equally use the same lists:

In [None]:
attributeNames=['A','B','C']
classNames=[0,1]
attributeValues=[0,1]

**Task 1**: Please repeat the counting of the attribute values per class using the `countImproved` data structure.

In [None]:
# your code
for row in data:
    # the class (1 (+) or 0 (-) ) is stored in the fourth column at index 3
    
    # now we go through the three columns with the attribute values, i is the index of the columns
    for i in range(len(attributeNames)): 
        # first let's retrieve the attribute name (index i of the names list)
        
        # and the value of the attribute (in index i)
        
        # and now we use theClass - theAttribute - theValue as the keys for the three level dictionary
        # and increment the counter
        
    # after we finished looking at all attribute values, we finished one occurence and increment the
    # corresponding counter for the class
    classCountImproved[theClass]+=1
    # and for the total numbers
    totalNumberImproved+=1

**Task 2:** Please calculate the new probabilities.

The dictionary definition for the probabilities remains unchanged:

In [None]:
probabilityImproved={1:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}},0:{'A':{0:0,1:0},'B':{0:0,1:0},'C':{0:0,1:0}}}
classProbabilityImproved={1:0,0:0}

In [None]:
# for the class probabilities
# your code
for cl in classNames:
    # please replace the 1 in the following assignment by the correct code
    classProbabilityImproved[cl]=1

In [None]:
# for the conditional probabilities
# your code
for cl in classNamesImproved:
    for a in attributeNames:
        for v in attributeValues:
            # replace the following assignment
            probabilityImproved[cl][a][v]=countImproved[cl][a][v]
            

In [None]:
probabilityImproved


Please repeat above tests by using the newly calculated probabilities:


**Test 1:** A tuple that exists only once in the table

In [None]:
values={'A':0,'B':0,'C':0}
print("The classification should be 1")
# classify the tuple

**Test 2:** A tuple that has 2 occurences with the same class:

In [None]:
values={'A':0,'B':1,'C':1}
print("The classification should be 0")
# classify the tuple

**Test 3:** A tuple that has three occurences with different classes. 

In [None]:
values={'A':1,'B':0,'C':1}
print("The tuple values have once class 1 and twice class 0")
# classify the tuple

**Test 4:** A new tuple

In [None]:
values={'A':0,'B':1,'C':0}
print("This tuple is new")
# classify the tuple


*End of the Notebook*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.