<a href="https://colab.research.google.com/github/UGisBusy/NB-offensive-language-classifier/blob/master/NB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 使用Naive Bayes方法製作不當言語分配器

採用資料集：https://huggingface.co/datasets/hate_speech_offensive

---

### 資料集處理

讀取原資料集，並重新定義類別。 <br>
- 將原資料集的**hate speech**與**offensive language**合併為**offensive(1)** <br>
- 將原資料集的**neither**重新命名為**neutral(0)**

In [None]:
from datasets import load_dataset

# 讀取原資料集
raw_datasets = load_dataset('hate_speech_offensive', split='train')

# 建立新資料集
# 更改標籤定義: 將原先 class 0(hate speech), 1(offensive language) 定義為 label 1，class 2(neither) 定義為 label 0
# 將兩種類別的資料分別寫入 text_neutral.txt, text_offensive.txt
dataset = []
f0 = open('text_neutral.txt', 'w', encoding='utf-8')
f1 = open('text_offensive.txt', 'w', encoding='utf-8')
for ds in raw_datasets:
    if(ds['class'] == 2):
        dataset.append({'text': ds['tweet'].replace('\n', ' '), 'label': 0})
        f0.write(ds['tweet'].replace('\n', ' ') + '\n')
    else:
        dataset.append({'text': ds['tweet'].replace('\n', ' '), 'label': 1})
        f1.write(ds['tweet'].replace('\n', ' ') + '\n')
f0.close()
f1.close()

print(f'dataset size: {len(dataset)}')

Found cached dataset hate_speech_offensive (C:/Users/user/.cache/huggingface/datasets/hate_speech_offensive/default/1.0.0/5f5dfc7b42b5c650fe30a8c49df90b7dbb9c7a4b3fe43ae2e66fabfea35113f5)


dataset size: 24783


### 分割訓練資料集與測試資料集

手動分割訓練資料集(90%)與測試資料集(10%)。<br>
由於是隨機分割，最後再評估成果時會執行多次採平均值。

In [None]:
from copy import deepcopy
from random import shuffle

# 分割 訓練資料集(90%) 與 測試資料集(10%)
def split_dataset(dataset, split_ratio=0.9):
    train_size = int(len(dataset) * split_ratio)
    tmp_dataset = deepcopy(dataset)
    shuffle(tmp_dataset)
    return tmp_dataset[:train_size], tmp_dataset[train_size:]

train_dataset, test_dataset = split_dataset(dataset)
print(f'train data size: {len(train_dataset)}')
print(f'test data size: {len(test_dataset)}')

train data size: 22304
test data size: 2479


### 製作詞袋與過濾詞彙

製作詞袋時同時計算|V|，供未來計算用。 <br>

用了以下幾種方式過濾詞彙：
- 英文轉小寫
- 刪除純數字
- 將@someone取代為@user
- 將#tag取代為tag
- 將網址取代為http
- 將&#dddd取代為其正確的為字
- 去除詞首尾的特殊符號

In [None]:
import re

# 過濾詞彙
def filter_word(raw_word):
    # 將英文轉為小寫
    word = raw_word.lower()

    # 將 純數字 刪除
    if(re.match(r'\d+', word)):
        return False

    # 將 @XXXX 轉為 @user
    if(re.match(r'@.*', word)):
        return '@user'
    
    # 將 #XXXX 轉為 XXXX
    if(re.match(r'#.*', word)):
        return word[1:]
    
    # 將 http://XXXX https://XXXX 轉為 http
    if(re.match(r'http://.*', word) or re.match(r'https://.*', word) ):
        return 'http'
    
    # 將 &#XXXX 轉為 chr(XXXX)
    while(re.match(r'.*&#\d+', word)):
        st = re.search(r'&#\d+', word).start()
        en = re.search(r'&#\d+', word).end()
        word = word[:st] + chr(int(word[st+2:en])) + word[en+1:]

    # 將首尾的標點符號去除
    while(re.match(r'^[^\w]', word)):
        word = word[1:]
    while(re.match(r'.*[^\w]$', word)):
        word = word[:-1]
    
    return word

# 製作詞袋
def make_bags(dataset):
    bags = {0:{}, 1:{}}
    V = 0
    for data in dataset:
        for raw_word in data['text'].split():
            if(not (word:=filter_word(raw_word))):
                continue
            if(word in bags[data['label']]):
                bags[data['label']][word] += 1
            else:
                if(word not in bags[1-data['label']]):
                    V += 1
                bags[data['label']][word] = 1
    
    # 將詞袋依照出現次數排序，並寫入 bag_neutral.txt, bag_offensive.txt
    lst = sorted([(data[0], data[1]) for data in bags[0].items()], key=lambda x: (-x[1], x[0]))
    open('bag_neutral.txt', 'w', encoding='utf-8').writelines([f'{data[0]} {data[1]}\n' for data in lst])
    lst = sorted([(data[0], data[1]) for data in bags[1].items()], key=lambda x: (-x[1], x[0]))
    open('bag_offensive.txt', 'w', encoding='utf-8').writelines([f'{data[0]} {data[1]}\n' for data in lst])
    return bags, V


### 使用詞袋進行預測
詳細的運算方式在報告裡。

In [None]:

# 使用詞袋進行預測
def predict(sentence, bags, V):
    p = [1, 1]
    for i in range(2):
        p_catgory = len(bags[i]) / (len(bags[0]) + len(bags[1]))
        count_catgory = sum(bags[i].values())
        for raw_word in sentence.split():
            if(not (word:=filter_word(raw_word))):
                continue
            count = bags[i][word] if(word in bags[i]) else 0
            p[i] *= (count + 1) / (count_catgory + V)
        p[i] *= p_catgory
    return 0 if(p[0] > p[1]) else 1


### 進行實驗


In [None]:
# 進行10次實驗
N = 10
record = {'accuracy': [], 'precision': [], 'recall': []}
for i in range(N):
    # 分割 訓練資料集(90%) 與 測試資料集(10%)
    train_dataset, test_dataset = split_dataset(dataset)
    
    # 製作詞袋
    bags, V = make_bags(train_dataset)
    
    # 計算準確率、紀錄結果
    conf_matrix = [[0, 0], [0, 0]]
    for data in test_dataset:
        conf_matrix[data['label']][predict(data['text'], bags, V)] += 1
    
    # 紀錄衡量指標
    accuracy = (conf_matrix[0][0] + conf_matrix[1][1]) / (conf_matrix[0][0] + conf_matrix[0][1] + conf_matrix[1][0] + conf_matrix[1][1])
    precision = conf_matrix[1][1] / (conf_matrix[1][1] + conf_matrix[0][1])
    recall = conf_matrix[1][1] / (conf_matrix[1][1] + conf_matrix[1][0])
    record['accuracy'].append(accuracy)
    record['precision'].append(precision)
    record['recall'].append(recall)

    # 輸出 accuracy、precision、recall、F1
    print(f'{i+1}th result:')
    print(f'accuracy:  {accuracy}')
    print(f'precision: {precision}')
    print(f'recall:    {recall}')
    print()

# 輸出平均 accuracy、precision、recall
print(f'averge accuracy:  {sum(record["accuracy"])/N}')
print(f'averge precision: {sum(record["precision"])/N}')
print(f'averge recall:    {sum(record["recall"])/N}')
    

1th result:
accuracy:  0.9290036304961679
precision: 0.9476879962634283
recall:    0.9694218824653608

2th result:
accuracy:  0.9318273497377975
precision: 0.9546742209631728
recall:    0.9651551312649165

3th result:
accuracy:  0.9261799112545381
precision: 0.9460227272727273
recall:    0.9666182873730044

4th result:
accuracy:  0.9294070189592578
precision: 0.9517607332368548
recall:    0.96337890625

5th result:
accuracy:  0.918918918918919
precision: 0.9502392344497608
recall:    0.9534325492078732

6th result:
accuracy:  0.9201290843081887
precision: 0.9428982725527831
recall:    0.9613502935420744

7th result:
accuracy:  0.9205324727712787
precision: 0.9428571428571428
recall:    0.9625668449197861

8th result:
accuracy:  0.9221460266236385
precision: 0.9445241511238642
recall:    0.9624756335282652

9th result:
accuracy:  0.9318273497377975
precision: 0.9510421715947649
recall:    0.9665024630541872

10th result:
accuracy:  0.9237595804759984
precision: 0.948300622307324
recall: