<a href="https://colab.research.google.com/github/UGisBusy/NB-offensive-language-classifier/blob/master/NB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 使用Naive Bayes方法製作不當言語分配器

採用資料集：https://huggingface.co/datasets/hate_speech_offensive

---

### 資料集處理

讀取原資料集，並重新定義類別。 <br>
- 將原資料集的**hate speech**與**offensive language**合併為**offensive(1)** <br>
- 將原資料集的**neither**重新命名為**neutral(0)**

In [13]:
from datasets import load_dataset

# 讀取原資料集
raw_datasets = load_dataset('hate_speech_offensive', split='train')

# 建立新資料集
# 更改標籤定義: 將原先 class 0(hate speech), 1(offensive language) 定義為 label 1，class 2(neither) 定義為 label 0
# 將兩種類別的資料分別寫入 text_neutral.txt, text_offensive.txt
dataset = []
f0 = open('text_neutral.txt', 'w', encoding='utf-8')
f1 = open('text_offensive.txt', 'w', encoding='utf-8')
for ds in raw_datasets:
    if(ds['class'] == 2):
        dataset.append({'text': ds['tweet'].replace('\n', ' '), 'label': 0})
        f0.write(ds['tweet'].replace('\n', ' ') + '\n')
    else:
        dataset.append({'text': ds['tweet'].replace('\n', ' '), 'label': 1})
        f1.write(ds['tweet'].replace('\n', ' ') + '\n')
f0.close()
f1.close()

print(f'dataset size: {len(dataset)}')

Found cached dataset hate_speech_offensive (C:/Users/user/.cache/huggingface/datasets/hate_speech_offensive/default/1.0.0/5f5dfc7b42b5c650fe30a8c49df90b7dbb9c7a4b3fe43ae2e66fabfea35113f5)


dataset size: 24783


### 分割訓練資料集與測試資料集

手動分割訓練資料集(90%)與測試資料集(10%)。<br>
由於是隨機分割，最後再評估成果時會執行多次採平均值。

In [14]:
from copy import deepcopy
from random import shuffle

# 分割 訓練資料集(90%) 與 測試資料集(10%)
def split_dataset(dataset, split_ratio=0.9):
    train_size = int(len(dataset) * split_ratio)
    tmp_dataset = deepcopy(dataset)
    shuffle(tmp_dataset)
    return tmp_dataset[:train_size], tmp_dataset[train_size:]

train_dataset, test_dataset = split_dataset(dataset)
print(f'train data size: {len(train_dataset)}')
print(f'test data size: {len(test_dataset)}')

train data size: 22304
test data size: 2479


### 過濾詞彙
用了以下幾種方式過濾詞彙：
- 英文轉小寫
- 刪除純數字
- 將@someone取代為@user
- 將#tag取代為tag
- 將網址取代為http
- 將&#dddd取代為其正確的為字
- 去除詞首尾的特殊符號

In [15]:
import re

# 過濾詞彙
def filter_word(raw_word):
    # 將英文轉為小寫
    word = raw_word.lower()

    # 將 純數字 刪除
    if(re.match(r'\d+', word)):
        return False

    # 將 @XXXX 轉為 @user
    if(re.match(r'@.*', word)):
        return '@user'
    
    # 將 #XXXX 轉為 XXXX
    if(re.match(r'#.*', word)):
        return word[1:]
    
    # 將 http://XXXX https://XXXX 轉為 http
    if(re.match(r'http://.*', word) or re.match(r'https://.*', word) ):
        return 'http'
    
    # 將 &#XXXX 轉為 chr(XXXX)
    while(re.match(r'.*&#\d+', word)):
        st = re.search(r'&#\d+', word).start()
        en = re.search(r'&#\d+', word).end()
        word = word[:st] + chr(int(word[st+2:en])) + word[en+1:]

    # 將首尾的標點符號去除
    while(re.match(r'^[^\w]', word)):
        word = word[1:]
    while(re.match(r'.*[^\w]$', word)):
        word = word[:-1]
    
    return word

### 製作詞袋
製作詞袋並計算 |V|、neutral/offensive 資料數。 <br>

In [16]:
# 製作詞袋、同時計算|V|, neutral/ofensive資料比數
def make_bags(dataset):
    bags = {0:{}, 1:{}}
    counts = {0:0, 1:0}
    V = 0
    for data in dataset:
        counts[data['label']] += 1
        for raw_word in data['text'].split():
            if(not (word:=filter_word(raw_word))):
                continue
            if(word in bags[data['label']]):
                bags[data['label']][word] += 1
            else:
                if(word not in bags[1-data['label']]):
                    V += 1
                bags[data['label']][word] = 1

    return bags, counts, V

bags, counts, V = make_bags(train_dataset)
print(f'|V| = {V}')
print(f'number of neutral data: {counts[0]}')
print(f'number of offensive data: {counts[1]}')

|V| = 22895
number of neutral data: 3744
number of offensive data: 18560


### 詞袋優化

使用去除common words改善詞袋效能 <br>
刪除同時出現在 neutral前100、offensive前200 的詞彙。<br>

In [17]:
# 去除common words提升效能
def optimize_bag(bags):
    # 找到正反兩詞袋中出現次數高詞彙
    common_words = []
    sorted_neutral_bags = sorted([(data[0], data[1]) for data in bags[0].items()], key=lambda x: (-x[1], x[0]))
    sorted_offensive_bags = sorted([(data[0], data[1]) for data in bags[1].items()], key=lambda x: (-x[1], x[0]))
    neutral_words = set([data[0] for data in sorted_neutral_bags[:100]])
    offensive_words = set([data[0] for data in sorted_offensive_bags[:200]])
    for word in neutral_words:
        if(word in offensive_words):
            common_words.append(word)
            offensive_words.remove(word)

    # 將正反詞袋中出現次數高詞彙刪除
    for word in common_words:
        bags[0].pop(word)
        bags[1].pop(word)
    
    return bags, common_words
    
# 將詞袋存檔
def save_bags(bags):
    # 將詞袋依照出現次數排序，並寫入 bag_neutral.txt, bag_offensive.txt
    lst = sorted([(data[0], data[1]) for data in bags[0].items()], key=lambda x: (-x[1], x[0]))
    open('bag_neutral.txt', 'w', encoding='utf-8').writelines([f'{data[0]} {data[1]}\n' for data in lst])
    lst = sorted([(data[0], data[1]) for data in bags[1].items()], key=lambda x: (-x[1], x[0]))
    open('bag_offensive.txt', 'w', encoding='utf-8').writelines([f'{data[0]} {data[1]}\n' for data in lst])

# 輸出common words範例
print(f'common words: {optimize_bag(bags)[1]}', ) 
    

common words: ['why', 'by', 'know', 'a', 'all', '@user', 'get', 'see', 'them', "don't", 'he', 'an', 'out', 'lol', 'back', 'my', 'are', 'amp', "i'm", "it's", 'got', 'like', 'http', 'and', 'too', 'his', 'when', 'time', 'go', 'new', 'one', 'up', 'good', 'can', 'rt', 'at', 'as', 'trash', 'more', 'with', 'we', 'there', 'your', 'day', 'want', 'i', 'will', 'u', 'for', 'was', 'but', 'do', 'that', 'off', 'how', 'just', 'me', 'has', 'no', 'the', 'in', 'their', 'it', 'still', 'or', 'love', 'be', 'man', 'only', 'some', 'now', 'make', 'about', 'been', 'they', 'what', 'would', 'this', 'is', 'people', 'have', 'who', 'look', 'you', 'of', 'if', 'so', 'not', 'to', 'from', 'on']


### 使用詞袋進行預測
詳細的運算方式在報告裡。

In [18]:

# 使用詞袋進行預測
def predict(sentence, bags, counts, V):
    p = [1, 1]
    for i in range(2):
        # P(C)
        p_catgory = counts[i] / (counts[0] + counts[1])
        
        # P(W|C)
        count_catgory = sum(bags[i].values())
        for raw_word in sentence.split():
            if(not (word:=filter_word(raw_word))):
                continue
            count = bags[i][word] if(word in bags[i]) else 0
            p[i] *= (count + 1) / (count_catgory + V)
        p[i] *= p_catgory
    return 0 if(p[0] > p[1]) else 1


### 進行實驗


In [19]:
# 進行10次實驗
N = 10
record = {'accuracy': [], 'precision': [], 'recall': []}
for i in range(N):
    # 分割 訓練資料集(90%) 與 測試資料集(10%)
    train_dataset, test_dataset = split_dataset(dataset)
    
    # 製作詞袋
    bags, counts, V = make_bags(train_dataset)
    # bags, _ = optimize_bag(bags)
    
    # 計算準確率、紀錄結果
    conf_matrix = [[0, 0], [0, 0]]
    for data in test_dataset:
        conf_matrix[data['label']][predict(data['text'], bags, counts, V)] += 1
    
    # 紀錄衡量指標
    accuracy = (conf_matrix[0][0] + conf_matrix[1][1]) / (conf_matrix[0][0] + conf_matrix[0][1] + conf_matrix[1][0] + conf_matrix[1][1])
    precision = conf_matrix[1][1] / (conf_matrix[1][1] + conf_matrix[0][1])
    recall = conf_matrix[1][1] / (conf_matrix[1][1] + conf_matrix[1][0])
    record['accuracy'].append(accuracy)
    record['precision'].append(precision)
    record['recall'].append(recall)

    # 輸出 accuracy、precision、recall、F1
    print(f'{i+1}th result:')
    print(f'accuracy:  {accuracy}')
    print(f'precision: {precision}')
    print(f'recall:    {recall}')
    print()

# 輸出平均 accuracy、precision、recall
print(f'average accuracy:  {sum(record["accuracy"])/N}')
print(f'average precision: {sum(record["precision"])/N}')
print(f'average recall:    {sum(record["recall"])/N}')

print(predict("ur mom so fat", bags, counts, V))
    

1th result:
accuracy:  0.9229528035498185
precision: 0.9417792268281323
recall:    0.9683908045977011

2th result:
accuracy:  0.9245663574021783
precision: 0.9434050514499532
recall:    0.9683149303888622

3th result:
accuracy:  0.918515530455829
precision: 0.9298162976919454
recall:    0.9738529847064628

4th result:
accuracy:  0.9156918112141993
precision: 0.9289033457249071
recall:    0.972749391727494

5th result:
accuracy:  0.9112545381202097
precision: 0.9250465549348231
recall:    0.9711632453567938

6th result:
accuracy:  0.9152884227511093
precision: 0.9292321924144311
recall:    0.9724104549854792

7th result:
accuracy:  0.9128680919725696
precision: 0.9276377217553688
recall:    0.97021484375

8th result:
accuracy:  0.918112141992739
precision: 0.923438233912635
recall:    0.9800598205383848

9th result:
accuracy:  0.9108511496571198
precision: 0.9288389513108615
recall:    0.9663906478324403

10th result:
accuracy:  0.9140782573618395
precision: 0.9269662921348315
recall:  

In [20]:
print(predict("im happy", bags, counts, V))

1
