<a href="https://colab.research.google.com/github/dynle/youtube-hate-speech-classification/blob/master/youtube_hate_speech_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Youtube Hate Speech Classification by BERT**

## Get dataset file from Kaggle and edit the structure of columns
https://www.kaggle.com/surekharamireddy/malignant-comment-classification


In [None]:
import pandas as pd
df = pd.read_csv('./dataset.csv')
df.head()

Unnamed: 0,isMalignant,comment_text
0,0,Explanation\nWhy the edits made under my usern...
1,0,D'aww! He matches this background colour I'm s...
2,0,"Hey man, I'm really not trying to edit war. It..."
3,0,"""\nMore\nI can't make any real suggestions on ..."
4,0,"You, sir, are my hero. Any chance you remember..."


In [None]:
df.info

<bound method DataFrame.info of         isMalignant                                       comment_text
0                 0  Explanation\nWhy the edits made under my usern...
1                 0  D'aww! He matches this background colour I'm s...
2                 0  Hey man, I'm really not trying to edit war. It...
3                 0  "\nMore\nI can't make any real suggestions on ...
4                 0  You, sir, are my hero. Any chance you remember...
...             ...                                                ...
159566            0  ":::::And for the second time of asking, when ...
159567            0  You should be ashamed of yourself \n\nThat is ...
159568            0  Spitzer \n\nUmm, theres no actual article for ...
159569            0  And it looks like it was actually you who put ...
159570            0  "\nAnd ... I really don't think you understand...

[159571 rows x 2 columns]>

In data, 1 denotes a malignant comment, and 0 denotes a normal comment

## Delete \n character in each text data

In [None]:
df['comment_text']=df['comment_text'].str.replace("\n"," ")

## Grouped data based on 0 and 1 at 'isMalignant' column

In [None]:
grouped = df.groupby(df.isMalignant)

group_0 = grouped.get_group(0)
group_1 = grouped.get_group(1)
group_0.values

array([[0,
        "Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"],
       [0,
        "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"],
       [0,
        "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."],
       ...,
       [0,
        'Spitzer   Umm, theres no actual article for prostitution ring.  - Crunch Captain.'],
       [0,
        'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.'],
       [0,
        '" And ... I really don\'t think you unders

## Get 600 data from dataset and save it in txt file

In [None]:
with open('dataset.txt','w') as f:
  for line in group_0.values[:300]:
    f.write(str(line[0])+'\t'+line[1]+'\n')
  for line in group_1.values[:300]:
    f.write(str(line[0])+'\t'+line[1]+'\n')

## Shuffle the dataset and split into train and test dataset

In [None]:
!shuf dataset.txt -o shuffled.txt
!head -400 shuffled.txt > train.txt
!tail -200 shuffled.txt > test.txt

open('train.txt').readlines()

## BERTをPythonで使うライブラリをインストールします．

In [None]:
!pip install transformers==4.5.0 fugashi==1.1.0 ipadic==1.0.0 pytorch-lightning==1.2.10

## 学習データを読み込みます．

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification

# 日本語の事前学習モデル
# MODEL_NAME = 'cl-tohoku/bert-base-japanese-whole-word-masking'
MODEL_NAME = 'bert-base-uncased'

# 学習データの読み込み
train_lines = [x.rstrip().split('\t')[1] for x in open("train.txt").readlines()]
train_labels = [int(x.split('\t')[0]) for x in open("train.txt").readlines()]

# テストデータの読み込み
test_lines = [x.rstrip().split('\t')[1] for x in open("test.txt").readlines()]
test_labels = [int(x.split('\t')[0]) for x in open("test.txt").readlines()]

# 単語分割モデルの読み込み
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

def create_dataset_for_loader(lines, labels):
  dataset_for_loader = []
  for i in range(len(lines)):
    encoding = tokenizer(lines[i],max_length=128,padding='max_length',truncation=True)
    encoding['labels'] = labels[i]
    encoding = { k: torch.tensor(v) for k, v in encoding.items() }
    dataset_for_loader.append(encoding)
  return dataset_for_loader

dataset_for_loader_train = create_dataset_for_loader(train_lines, train_labels)
dataset_for_loader_test = create_dataset_for_loader(test_lines, test_labels)

dataset_train = dataset_for_loader_train[50:] # 学習データ
dataset_val = dataset_for_loader_train[:50] # 検証データ
dataset_test = dataset_for_loader_test # 評価データ

dataloader_train = DataLoader(
    dataset_train, batch_size=16, shuffle=True
) 
dataloader_val = DataLoader(dataset_val, batch_size=16)
dataloader_test = DataLoader(dataset_test, batch_size=1)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

## 以下のコードにより，モデルの定義をします．

In [None]:
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader

from transformers import BertJapaneseTokenizer, BertForSequenceClassification
import pytorch_lightning as pl

class BertForSequenceClassification_pl(pl.LightningModule):
    def __init__(self, model_name, num_labels, lr):
        super().__init__()
        self.save_hyperparameters() 
        self.bert_sc = BertForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels)
        self.test_results = []
        
    def training_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        loss = output.loss
        self.log('train_loss', loss) 
        return loss
        
    def validation_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        val_loss = output.loss
        self.log('val_loss', val_loss)

    def reset_test_results(self):
        self.test_results = []

    def test_step(self, batch, batch_idx):
        labels = batch.pop('labels')
        output = self.bert_sc(**batch)
        probs = torch.nn.functional.softmax(output.logits,dim=-1)
        labels_predicted = output.logits.argmax(-1)
        num_correct = ( labels_predicted == labels ).sum().item()
        accuracy = num_correct/labels.size(0) 
        hyp = labels_predicted.cpu().numpy()[0]
        ref = labels.cpu().numpy()[0]
        prob = probs.cpu().numpy()[0][hyp]
        self.test_results.append({"hyp":hyp, "ref": ref, "prob":prob})
        self.log('accuracy', accuracy)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# 学習時にモデルの重みを保存する条件を指定
checkpoint = pl.callbacks.ModelCheckpoint(
    monitor='val_loss',
    mode='min',
    save_top_k=1,
    save_weights_only=True,
    dirpath='model/',
)

# 学習が進まなくなったら終了する条件を指定
early_stopping = pl.callbacks.EarlyStopping(
    min_delta=0.00,
    patience=1,
    verbose=True,
    monitor='val_loss',
    mode='min',
)    

# 学習の方法を指定
trainer = pl.Trainer(
    gpus=1, 
    max_epochs=10,
    callbacks = [checkpoint,early_stopping]
)

# 学習に利用するモデルの作成
model = BertForSequenceClassification_pl(MODEL_NAME, num_labels=2, lr=1e-5)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## 学習の実行 (Takes some time)

In [None]:
trainer.fit(model, dataloader_train, dataloader_val) 
best_model_path = checkpoint.best_model_path # ベストモデルのファイル
print('ベストモデルのファイル: ', checkpoint.best_model_path)
print('ベストモデルの検証データに対する損失: ', checkpoint.best_model_score)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type                          | Params
----------------------------------------------------------
0 | bert_sc | BertForSequenceClassification | 109 M 
----------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
437.935   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

ベストモデルのファイル:  /content/model/epoch=2-step=65.ckpt
ベストモデルの検証データに対する損失:  tensor(0.4342, device='cuda:0')


## 分類の実行と結果の表示．連番，正解ラベル，予測ラベル，確率，本文の順番で結果が出ます．

In [None]:
model.reset_test_results()
test = trainer.test(test_dataloaders=dataloader_test)

for i in range(len(test_lines)):
  line = test_lines[i]
  label = test_labels[i]
  d = model.test_results[i]
  hyp = d['hyp'].item()
  prob = d['prob'].item() 
  print(f"{i+1}\t{label}\t{hyp}\t{prob}\t{line}")

print(f'Accuracy: {test[0]["accuracy"]:.3f}')

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'accuracy': 0.8999999761581421}
--------------------------------------------------------------------------------
1	0	0	0.7955997586250305	"   You beat me to it.   Just wanted to say good work for beating me to the revert on the Ned Kelly article, I always enjoy it when I know someone else is on the hunt for vandals. Happy Hunting. Cheers Pro "
2	0	0	0.5162985920906067	The problem is that people keep trying to state that BSAs policies are this and that.  There is no policy that says homosexual scouts cannot be members.  To state otherwise is a lie and OR.  Just be it is a contraversy page doesn't allow you state lies or add OR, you can list things as misinterpretations of the rules and things like that but it you try to say scouting does this or scouting does that then you need to be sure scouting actually says that.  Even the misinterpretations of the rules need sources.  Just be

### Classify youtube comments extracted by Youtube API whether each comment is a hate speech or not

In [None]:
comment_lines = [x.strip() for x in open("comments.txt").readlines()]
comment_labels = [1 for x in comment_lines]

dataset_for_loader_tweet = create_dataset_for_loader(comment_lines, comment_labels)
dataloader_comment = DataLoader(dataset_for_loader_tweet, batch_size=1)

model.reset_test_results()
test = trainer.test(test_dataloaders=dataloader_comment)

for i in range(len(comment_lines)):
  line = comment_lines[i]
  d = model.test_results[i]
  hyp = d['hyp'].item()
  prob = d['prob'].item() 
  # if hyp == 1 and prob > 0.9: # 1である確率が0.9以上のものに厳選
  print(f"{i+1}\t{hyp}\t{prob}\t{line}")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'accuracy': 0.4761904776096344}
--------------------------------------------------------------------------------
1	1	0.7130576372146606	0
2	0	0.8615444898605347	"Like I said in the video, subscribe if you haven’t already and you could win $10,000!"
3	1	0.8261112570762634	Me sub mrbeast
4	1	0.8039590120315552	you are so kind
5	0	0.8495272994041443	"10 ,000"
6	0	0.8527193665504456	"Can we just appreciate how much money he spends on his friends and family, and random ppl he doesn’t even know. He’s truly amazing"
7	0	0.8723955154418945	if people actually died in this everyone&#39;s reactions would be way more aggressive
8	0	0.8022157549858093	Here before 200M views!
9	1	0.9042181372642517	How the hell much did you spend!?
10	0	0.7927834987640381	"This game is so much fun, Jimmy never stops making popular videos"
11	0	0.5792055726051331	who is 001
12	1	0.9035159349441528	Awwww its fla