# Lab 2 : Naive Bayes Classifier for Document Classification

@copyright: 
    (c) 2023. iKnow Lab. Ajou Univ., All rights reserved.

M.S. Student: Wansik-Jo (jws5327@ajou.ac.kr)

# For assignment

- Python code의 주석 처리되어있는 부분을 구현하면 됩니다.
- MD 형식의 Cell의 [BLANK] 부분을 채우면 됩니다.
- MD 형식의 Cell의 [ANSWER] 부분 이후에 답을 작성하면 됩니다.
- 조교에게 퀴즈의 답과 함께 코드 실행 결과를 보여준 뒤, BB에 제출 후 가시면 됩니다.


## 목차
1. Document Classification task 이해하기
    - Dataset 살펴보기
    - Task 이해하기
2. Data Preprocessing 
    - Dataloader 구현하기
    - Data 확인하기
    - Data preprocessing 구현하기
3. Naive Bayes Classifier 구현하기
    - 수식 이해하기
    - Vocabulary 구축하기
    - Prior 계산하기
    - Likelihood 계산하기
    - Posterior 계산하기
    - Prediction 구하기
    - Model 학습하기
    - Model 성능 평가하기
4. Naive Bayes Classifier의 여러 변형 모델 응용하기
    - 과제
---

## 1. Document Classification task 이해하기
- Dataset: NewsGroup Documents datasets
- Task: Document Classification
- Input: Document
- Output: Document Category
- Model: Naive Bayes Classifier
- Evaluation: Confusion Matrix, Accuracy

### Document Classification dataset

- URL : https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification

- Kaggle 10 group dataset
- 각 Document는, 10개의 category 중 하나에 속함
- Categories : business, entertainment, food, graphics, historical, medical, politics, space, sport, technologie
- URL 또는 강의노트에서 제공되는 data를 받아서 사용

In [44]:
#dataset load
categories = {'business', 'entertainment', 'food', 'graphics', 'historical', 'medical', 'politics', 'space', 'sport', 'technologie'}

import os

dataset = []
for category in categories:
    for filename in os.listdir('data/' + category):
        with open('data/' + category + '/' + filename, 'r') as f:
            instance = {}
            instance['text'] = f.read()
            instance['category'] = category
            dataset.append(instance)

import random
random.seed(42)
random.shuffle(dataset)
train_data = dataset[:int(len(dataset) * 0.8)]
test_data = dataset[int(len(dataset) * 0.8):]

print(len(train_data), len(test_data))

800 200


In [45]:
#data 확인
print(train_data[0]['text'])
print(train_data[0]['category'])

#label 비율 확인
from collections import Counter
print(Counter([instance['category'] for instance in train_data]))

Brewers' profits lose their fizz

Heineken and Carlsberg, two of the world's largest brewers, have reported falling profits after beer sales in western Europe fell flat.

Dutch firm Heineken saw its annual profits drop 33% and warned that earnings in 2005 may also slide. Danish brewer Carlsberg suffered a 3% fall in profits due to waning demand and increased marketing costs. Both are looking to Russia and China to provide future growth as western European markets are largely mature.

Heineken's net income fell to 537m euros ($701m; £371m) during 2004, from 798m euro a year ago. It blamed weak demand in western Europe and currency losses. It had warned in September that the weakening US dollar, which has cut the value of foreign sales, would knock 125m euros off its operating profits. Despite the dip in profits, Heineken's sales have been improving and total revenue for the year was 10bn euros, up 8.1% from 9.26bn euros in 2003. Heineken said it now plans to invest 100m euros in "aggres

## 2. Data Preprocessing 구현하기

## Document raw data preprocessing

- 학습한 여러 기법을 활용하여 데이터 전처리를 진행
- Tokenizing, Stemming, Lemmatizing, Stopword 제거, Punctuation 제거, Case 변환 등 다양한 NLP pre processing 방법을 적용
- News data에 맞는 e-mail, url, phone number, number, date, time 등의 특수문자 제거 (Regex)

- !nltk library 만 사용 가능 (import nltk)
- !단 e-mail, url, phone number, number, date, time 등에 관한 처리는 regex를 사용하여 직접 구현

In [54]:
import nltk
import re

#You can use pre-defined tools in nltk, like Tokenizer, Stopword list, etc.

for instance in train_data:
    #You should implement any pre-processing if you need.
    text = instance['text'].lower()

    #Or like here, you can use regular expression to remove any unwanted characters.
    text = re.sub(r'[^a-z\s]', '', text)

    """
    From here, you should implement any pre-process method like tokenizer, stopword list, stemming, lemmatization, etc.
    Or the regular expression above can be considered.

    text = 
    
    """
    instance['tokens'] = text

for instance in test_data:
    #And also for test data.

In [49]:
#check
print(train_data[0]['text'])
print(train_data[0]['tokens'])

Brewers' profits lose their fizz

Heineken and Carlsberg, two of the world's largest brewers, have reported falling profits after beer sales in western Europe fell flat.

Dutch firm Heineken saw its annual profits drop 33% and warned that earnings in 2005 may also slide. Danish brewer Carlsberg suffered a 3% fall in profits due to waning demand and increased marketing costs. Both are looking to Russia and China to provide future growth as western European markets are largely mature.

Heineken's net income fell to 537m euros ($701m; £371m) during 2004, from 798m euro a year ago. It blamed weak demand in western Europe and currency losses. It had warned in September that the weakening US dollar, which has cut the value of foreign sales, would knock 125m euros off its operating profits. Despite the dip in profits, Heineken's sales have been improving and total revenue for the year was 10bn euros, up 8.1% from 9.26bn euros in 2003. Heineken said it now plans to invest 100m euros in "aggres

## 3. Naive Bayes Classifier 구현하기

### Naive Bayes to Documents and Categories

document $d$ 와 category $c$ 에 대하여, posterior probability는 다음과 같이 구할 수 있다.

$$ P(c|d) = \frac{P(d|c)P(c)}{P(d)} $$

이때 $P(c)$ 를 category $c$ 의[BLANK],

$P(d|c)$ 를 category $c$ 에서 document $d$ 의 [BLANK],

$P(d)$ 를 document $d$ 의 [BLANK] 라고 한다.

---

### Naive Bayes Classifier

이 때, MAP(most probable category) $c_{MAP}$ for a document $d$ 는 다음과 같이 구할 수 있다.

$$ C_{MAP} = \underset{c \in C}{\operatorname{argmax}} P(c|d) $$

그러나, $P(c|d)$ 를 직접 계산하기는 어렵다. ($P(d)$ 를 계산하기 어렵기 때문에)

따라서, 여기서 Bayes' theorem 을 사용하여 $P(c|d)$ 를 다음과 같이 바꿀 수 있다.

$$ P(c|d) = \frac{P(d|c)P(c)}{P(d)} $$

</br>

이때, $P(d)$ 는 모든 category $c$ 에 대하여 동일하므로, $P(d)$ 는 상수이다. (계산할 필요가 없다.)

따라서,

$$ C_{MAP} = \underset{c \in C}{\operatorname{argmax}} P(d|c)P(c) $$

이때, $P(d|c)$ 를 [BLANK], $P(c)$ 를 [BLANK] 라고 한다.

---

### Multinomial Naive Bayes Classifier

likelihood $P(d|c)$ 는 다음과 같이 계산할 수 있다.

$$ P(d|c) = P(w_1, w_2, \cdots, w_n|c) = \prod_{i=1}^{n} P(w_i|c) $$

이때 $w_i$ 는 $i$-th 번째 word in document $d$ 이다.

따라서, 최종 식은 다음과 같다.

$$ C_{NB} = \underset{c \in C}{\operatorname{argmax}} P(c) \prod_{i=1}^{n} P(w_i|c) $$

! 여기서 발생할 수 있는 문제점은?

[ANSWER] :

---

### Log Likelihood

문제는 likelihood $P(d|c)$ 가 매우 작은 값이 될 수 있다는 것이다.

따라서, likelihood 대신 log likelihood 를 사용하면 다음과 같다.

$$ C_{NB} = \underset{c \in C}{\operatorname{argmax}} P(c) \sum_{i=1}^{n} \log P(w_i|c) $$

! 여기서 발생할 수 있는 문제점은?

[ANSWER] :

---




### Laplace smoothing

문제는 likelihood $P(d|c)$ 가 0이 될 수 있다는 것이다.

따라서 zero0-probability 를 방지하기 위하여 Laplace smoothing 을 사용한다.

$$ P(w_i|c) = \frac{count(w_i, c) + \alpha}{count(c) + \alpha \times |V|} $$

이때, $count(w_i, c)$ 는 category $c$ 에서 $w_i$ 의 개수이고

$count(c)$ 는 category $c$ 에서 word 의 총 개수,

$|V|$ 는 vocabulary 의 길이.

$\alpha$ 는 smoothing parameter 이다.

---

In [50]:
#make vocabulary
vocabulary = set()
for instance in train_data:
    vocabulary.update(instance['tokens'])

print(categories)
print(len(vocabulary))

{'sport', 'entertainment', 'space', 'politics', 'historical', 'graphics', 'technologie', 'business', 'food', 'medical'}
18717


In [64]:
from collections import defaultdict
import math

def train_classifier(data, categories, vocabulary):
    prior = {category: 0 for category in categories}
    likelihood = defaultdict(lambda: defaultdict(int))
    category_counts = {category: 0 for category in categories} # Initialize category_counts to track total tokens per category

    #You should implement the training process of Naive Bayes classifier.

    for instance in data:
        for token in instance['tokens']:
            """
            Here, you should calculate the prior and likelihood.
            
            """

    total_instances = len(data)
    for category in categories:
        """ 
        Here, calculate the prior and likelihood, for each category.
        Like we have learned above, you should use Log-likelihood, and Laplace smoothing.
        
        prior[category] =
        for word in vocabulary:
            likelihood[category][word] =

        """

    return prior, likelihood, category_counts

prior, likelihood, category_counts = train_classifier(train_data, categories, vocabulary)

In [65]:
def predict(instance, prior, likelihood, categories, category_counts, vocabulary):
    score = {}

    #You should implement the prediction process of Naive Bayes classifier.

    for category in categories:
        """
        Here, you should calculate the score of each category.
        score[category] = prior[category]
        for token in instance['tokens']:
            score[catetory] += 
            You should consider the case when the token is unseen here.

        """

    return max(score, key=score.get)

print(predict(test_data[0], prior, likelihood, categories, category_counts, vocabulary))

medical


In [69]:
def evaluate(data, prior, likelihood, categories, category_counts, vocabulary):
    metrics = {
        'TP': {category: 0 for category in categories},
        'TN': {category: 0 for category in categories},
        'FP': {category: 0 for category in categories},
        'FN': {category: 0 for category in categories},
    }

    #You should implement confusion matrix here. Fill in if statement below.
    for instance in data:
        true_category = instance['category']
        predicted_category = predict(instance, prior, likelihood, categories, category_counts, vocabulary)
        
        for category in categories:
            if :
                metrics['TP'][category] += 1
            if :
                metrics['TN'][category] += 1
            if :
                metrics['FP'][category] += 1
            if :
                metrics['FN'][category] += 1

    precision = {category: metrics['TP'][category] / (metrics['TP'][category] + metrics['FP'][category]) for category in categories}
    recall = {category: metrics['TP'][category] / (metrics['TP'][category] + metrics['FN'][category]) for category in categories}
    accuracy = sum(metrics['TP'].values()) / len(data)
    
    return accuracy, precision, recall

accuracy, precision, recall = evaluate(test_data, prior, likelihood, categories, category_counts, vocabulary)
print(accuracy, precision, recall)

0.97 {'sport': 1.0, 'entertainment': 1.0, 'space': 1.0, 'politics': 0.8421052631578947, 'historical': 0.9285714285714286, 'graphics': 1.0, 'technologie': 1.0, 'business': 1.0, 'food': 0.9411764705882353, 'medical': 0.9615384615384616} {'sport': 1.0, 'entertainment': 1.0, 'space': 0.9166666666666666, 'politics': 0.9411764705882353, 'historical': 1.0, 'graphics': 1.0, 'technologie': 1.0, 'business': 0.92, 'food': 0.9411764705882353, 'medical': 1.0}


## 4. Naive Bayes Classifier의 여러 변형 모델 응용하기

- N-gram을 활용한 Naive Bayes Classifier (과제)