
# Pytorch + HuggingFace
## KoElectra Model
박장원님의 KoElectra-small 사용<br>
https://monologg.kr/2020/05/02/koelectra-part1/<br>
https://github.com/monologg/KoELECTRA

## Dataset
네이버 영화 리뷰 데이터셋<br>
https://github.com/e9t/nsmc

## References
- https://huggingface.co/transformers/training.html
- https://tutorials.pytorch.kr/beginner/data_loading_tutorial.html
- https://tutorials.pytorch.kr/beginner/blitz/cifar10_tutorial.html
- https://wikidocs.net/44249

## 주의사항
꼭 GPU로 해주세요 - 1epoch 당 약 20분 소요

In [1]:
# HuggingFace transformers 설치 및 NSMC 데이터셋 다운로드
!pip install transformers
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

/bin/bash: line 1: pip: command not found
--2023-10-06 09:34:06--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4893335 (4.7M) [text/plain]
Saving to: ‘ratings_test.txt.1’


2023-10-06 09:34:06 (139 MB/s) - ‘ratings_test.txt.1’ saved [4893335/4893335]

--2023-10-06 09:34:06--  https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14628807 (14M) [text/plain]
Saving to: ‘ratings_train.txt.1’


2023-10-0

In [2]:
!head ratings_train.txt
!head ratings_test.txt

id	document	label
9976970	아 더빙.. 진짜 짜증나네요 목소리	0
3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
10265843	너무재밓었다그래서보는것을추천한다	0
9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다	1
5403919	막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.	0
7797314	원작의 긴장감을 제대로 살려내지못했다.	0
9443947	별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네	0
7156791	액션이 없는데도 재미 있는 몇안되는 영화	1
id	document	label
6270596	굳 ㅋ	1
9274899	GDNTOPCLASSINTHECLUB	0
8544678	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
6825595	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
6723715	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0
7898805	음악이 주가 된, 최고의 음악영화	1
6315043	진정한 쓰레기	0
6097171	마치 미국애니에서 튀어나온듯한 창의력없는 로봇디자인부터가,고개를 젖게한다	0
8932678	갈수록 개판되가는 중국영화 유치하고 내용없음 폼잡다 끝남 말도안되는 무기에 유치한cg남무 아 그립다 동사서독같은 영화가 이건 3류아류작이다	0


In [3]:
import pandas as pd
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, ElectraForSequenceClassification, AdamW
from tqdm.notebook import tqdm

In [4]:
# GPU 사용
device = torch.device("cuda")

# Dataset 만들어서 불러오기

In [5]:
class NSMCDataset(Dataset):

  def __init__(self, csv_file):
    # 일부 값중에 NaN이 있음...
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0)
    # 중복제거
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = AutoTokenizer.from_pretrained("monologg/koelectra-small-v2-discriminator")

    print(self.dataset.describe())

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=256,
        pad_to_max_length=True,
        add_special_tokens=True
        )

    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [6]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


# Create Model

In [7]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-base-v3-discriminator").to(device)

# 한번 실행해보기
# text, attention_mask, y = train_dataset[0]
# model(text.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: 

In [8]:
model.load_state_dict(torch.load("model.pt"))

FileNotFoundError: [Errno 2] No such file or directory: 'model.pt'

In [9]:
# 모델 레이어 보기
model

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

# Learn

In [10]:
epochs = 5
batch_size = 16

In [11]:
optimizer = AdamW(model.parameters(), lr=5e-6)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)



In [12]:
losses = []
accuracies = []

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device),
                   attention_mask=attention_masks_batch.to(device))[0]
    loss = F.cross_entropy(y_pred, y_batch)
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)

  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)

  0%|          | 0/9137 [00:00<?, ?it/s]



Batch Loss: 69.39986997842789 Accuracy: tensor(0.5100, device='cuda:0')
Batch Loss: 138.429516851902 Accuracy: tensor(0.5241, device='cuda:0')
Batch Loss: 207.1890783905983 Accuracy: tensor(0.5325, device='cuda:0')
Batch Loss: 274.6321071386337 Accuracy: tensor(0.5469, device='cuda:0')
Batch Loss: 340.98795729875565 Accuracy: tensor(0.5605, device='cuda:0')
Batch Loss: 405.82134771347046 Accuracy: tensor(0.5726, device='cuda:0')
Batch Loss: 469.6871585249901 Accuracy: tensor(0.5813, device='cuda:0')
Batch Loss: 530.7550293803215 Accuracy: tensor(0.5917, device='cuda:0')
Batch Loss: 591.7895978093147 Accuracy: tensor(0.6016, device='cuda:0')
Batch Loss: 653.2738831937313 Accuracy: tensor(0.6074, device='cuda:0')
Batch Loss: 713.646975338459 Accuracy: tensor(0.6144, device='cuda:0')
Batch Loss: 773.4411258995533 Accuracy: tensor(0.6196, device='cuda:0')
Batch Loss: 832.0217029154301 Accuracy: tensor(0.6260, device='cuda:0')
Batch Loss: 889.4490215778351 Accuracy: tensor(0.6314, device='c

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 45.865626722574234 Accuracy: tensor(0.7738, device='cuda:0')
Batch Loss: 86.68280711025 Accuracy: tensor(0.7881, device='cuda:0')
Batch Loss: 126.58210725337267 Accuracy: tensor(0.7938, device='cuda:0')
Batch Loss: 169.25514512509108 Accuracy: tensor(0.7937, device='cuda:0')
Batch Loss: 210.72226057201624 Accuracy: tensor(0.7954, device='cuda:0')
Batch Loss: 253.67399633675814 Accuracy: tensor(0.7967, device='cuda:0')
Batch Loss: 296.3331767693162 Accuracy: tensor(0.7969, device='cuda:0')
Batch Loss: 339.14958999305964 Accuracy: tensor(0.7960, device='cuda:0')
Batch Loss: 381.9887008443475 Accuracy: tensor(0.7943, device='cuda:0')
Batch Loss: 422.2047255858779 Accuracy: tensor(0.7957, device='cuda:0')
Batch Loss: 463.93116687983274 Accuracy: tensor(0.7970, device='cuda:0')
Batch Loss: 502.8370853289962 Accuracy: tensor(0.7988, device='cuda:0')
Batch Loss: 544.748451538384 Accuracy: tensor(0.7988, device='cuda:0')
Batch Loss: 584.5420909449458 Accuracy: tensor(0.8006, device

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 34.10570240393281 Accuracy: tensor(0.8431, device='cuda:0')
Batch Loss: 70.9146230109036 Accuracy: tensor(0.8384, device='cuda:0')
Batch Loss: 107.15166762843728 Accuracy: tensor(0.8404, device='cuda:0')
Batch Loss: 144.84247405454516 Accuracy: tensor(0.8380, device='cuda:0')
Batch Loss: 178.96357786282897 Accuracy: tensor(0.8396, device='cuda:0')
Batch Loss: 210.79683342203498 Accuracy: tensor(0.8443, device='cuda:0')
Batch Loss: 247.87752669677138 Accuracy: tensor(0.8425, device='cuda:0')
Batch Loss: 282.7382583282888 Accuracy: tensor(0.8432, device='cuda:0')
Batch Loss: 317.3107353262603 Accuracy: tensor(0.8438, device='cuda:0')
Batch Loss: 353.2925144918263 Accuracy: tensor(0.8429, device='cuda:0')
Batch Loss: 388.45303281769156 Accuracy: tensor(0.8430, device='cuda:0')
Batch Loss: 423.2986331395805 Accuracy: tensor(0.8428, device='cuda:0')
Batch Loss: 457.9618362374604 Accuracy: tensor(0.8425, device='cuda:0')
Batch Loss: 491.7219005562365 Accuracy: tensor(0.8428, devi

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 32.44971141964197 Accuracy: tensor(0.8537, device='cuda:0')
Batch Loss: 62.2428475022316 Accuracy: tensor(0.8656, device='cuda:0')
Batch Loss: 92.94741145521402 Accuracy: tensor(0.8638, device='cuda:0')
Batch Loss: 123.07550092041492 Accuracy: tensor(0.8653, device='cuda:0')
Batch Loss: 154.30858182907104 Accuracy: tensor(0.8640, device='cuda:0')
Batch Loss: 185.1712251007557 Accuracy: tensor(0.8635, device='cuda:0')
Batch Loss: 217.05486054718494 Accuracy: tensor(0.8629, device='cuda:0')
Batch Loss: 250.97056497633457 Accuracy: tensor(0.8613, device='cuda:0')
Batch Loss: 279.75236500799656 Accuracy: tensor(0.8627, device='cuda:0')
Batch Loss: 311.52264492213726 Accuracy: tensor(0.8628, device='cuda:0')
Batch Loss: 342.7262018471956 Accuracy: tensor(0.8634, device='cuda:0')
Batch Loss: 374.7665247730911 Accuracy: tensor(0.8635, device='cuda:0')
Batch Loss: 403.5080112479627 Accuracy: tensor(0.8643, device='cuda:0')
Batch Loss: 434.68556926772 Accuracy: tensor(0.8642, device

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 26.95047152042389 Accuracy: tensor(0.8800, device='cuda:0')
Batch Loss: 56.24812461063266 Accuracy: tensor(0.8762, device='cuda:0')
Batch Loss: 85.06918613612652 Accuracy: tensor(0.8769, device='cuda:0')
Batch Loss: 114.6578822080046 Accuracy: tensor(0.8756, device='cuda:0')
Batch Loss: 145.65880560316145 Accuracy: tensor(0.8735, device='cuda:0')
Batch Loss: 175.09387567080557 Accuracy: tensor(0.8727, device='cuda:0')
Batch Loss: 202.03135455586016 Accuracy: tensor(0.8746, device='cuda:0')
Batch Loss: 229.0541551131755 Accuracy: tensor(0.8757, device='cuda:0')
Batch Loss: 256.63424874283373 Accuracy: tensor(0.8753, device='cuda:0')
Batch Loss: 284.80218826793134 Accuracy: tensor(0.8766, device='cuda:0')
Batch Loss: 310.2028466183692 Accuracy: tensor(0.8777, device='cuda:0')
Batch Loss: 336.632358180359 Accuracy: tensor(0.8786, device='cuda:0')
Batch Loss: 361.70219078846276 Accuracy: tensor(0.8800, device='cuda:0')
Batch Loss: 389.63325539417565 Accuracy: tensor(0.8804, dev

In [13]:
losses, accuracies

([4651.6763158887625,
  3617.4758188501,
  3179.780110117048,
  2830.5081308037043,
  2554.9124857839197],
 [tensor(0.7406, device='cuda:0'),
  tensor(0.8154, device='cuda:0'),
  tensor(0.8438, device='cuda:0'),
  tensor(0.8645, device='cuda:0'),
  tensor(0.8806, device='cuda:0')])

테스트 데이터셋 정확도 확인하기

In [14]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)

  0%|          | 0/3073 [00:00<?, ?it/s]

Accuracy: tensor(0.8505, device='cuda:0')


In [15]:
# 모델 저장하기
torch.save(model.state_dict(), "model.pt")

### 모델 적용

In [1]:
import pandas as pd
import torch
from torch.nn import functional as F
from transformers import AutoTokenizer, ElectraForSequenceClassification

2023-10-10 02:45:40.515480: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-10 02:45:41.302169: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 02:45:41.302207: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 02:45:41.306111: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 02:45:41.683350: I tensorflow/core/platform/cpu_feature_g

In [2]:
# GPU 사용
device = torch.device("cuda")

# 모델 불러오기
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-base-v3-discriminator").to(device)
model.load_state_dict(torch.load("model.pt"))
model.eval()

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: 

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

In [6]:
!ls

'Project_2023년 강서구청장 보궐선거 예측분석(영화리뷰기반모델).ipynb'
 model.pt
 ratings_test.txt
 ratings_test.txt.1
 ratings_train.txt
 ratings_train.txt.1
 results_권수정.csv
 results_김태우.csv
 results_진교훈.csv


In [16]:
# 데이터프레임 생성
df_jin = pd.read_csv("results_진교훈.csv")
df_kim = pd.read_csv("results_김태우.csv")
df_kwon = pd.read_csv("results_권수정.csv")

In [17]:
# 토크나이저 설정
tokenizer = AutoTokenizer.from_pretrained("monologg/koelectra-small-v2-discriminator")

In [18]:
# 댓글 예측 함수 정의
def predict_comments(df):
    inputs = tokenizer(df["댓글"].tolist(), return_tensors='pt', truncation=True, max_length=256, pad_to_max_length=True, add_special_tokens=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = F.softmax(outputs.logits, dim=1)
    _, predicted_labels = torch.max(predictions, 1)
    return predicted_labels.cpu().tolist()

In [19]:
# 각 데이터프레임의 댓글 예측
predictions_jin = predict_comments(df_jin)
predictions_kim = predict_comments(df_kim)
predictions_kwon = predict_comments(df_kwon)

# 예측 결과 출력
print("Jin 댓글 예측 결과:", predictions_jin)
print("Kim 댓글 예측 결과:", predictions_kim)
print("Kwon 댓글 예측 결과:", predictions_kwon)



Jin 댓글 예측 결과: [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1

In [20]:
# 예측 결과를 "긍정" 또는 "부정"으로 변환하는 함수 정의
def convert_to_sentiment(predictions):
    sentiment_labels = ["부정", "긍정"]
    return [sentiment_labels[prediction] for prediction in predictions]

# 예측 결과를 "긍정" 또는 "부정"으로 변환
sentiments_jin = convert_to_sentiment(predictions_jin)
sentiments_kim = convert_to_sentiment(predictions_kim)
sentiments_kwon = convert_to_sentiment(predictions_kwon)

In [21]:
# 예측 결과를 데이터프레임에 추가
df_jin["예측 감정"] = sentiments_jin
df_kim["예측 감정"] = sentiments_kim
df_kwon["예측 감정"] = sentiments_kwon

Jin 데이터프레임:
                                                동영상 제목                   게시일  \
0    [시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...  2023-09-25T01:45:14Z   
1    [시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...  2023-09-25T01:45:14Z   
2    [시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...  2023-09-25T01:45:14Z   
3    [시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...  2023-09-25T01:45:14Z   
4    [시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...  2023-09-25T01:45:14Z   
..                                                 ...                   ...   
913     서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS  2023-10-05T13:54:15Z   
914     서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS  2023-10-05T13:54:15Z   
915     서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS  2023-10-05T13:54:15Z   
916     서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS  2023-10-05T13:54:15Z   
917     서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS  2023-10-05T13:54:15Z   

     영상 좋아요 수              

In [23]:
df_jin

Unnamed: 0,동영상 제목,게시일,영상 좋아요 수,댓글,작성자,댓글 작성일,댓글 좋아요 수,예측 감정
0,[시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...,2023-09-25T01:45:14Z,651,이죄명을 위한\n투표는\n나라를 망치는일이다\n김태우 후보님이\n꼭 승리해야...,점순 조,2023-10-09T00:00:18Z,1,긍정
1,[시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...,2023-09-25T01:45:14Z,651,김에게 정의를 보여줍시다.진교훈님 당선 응원합니다,dolma chiring,2023-10-07T01:15:17Z,0,긍정
2,[시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...,2023-09-25T01:45:14Z,651,진교훈 응원합니다,dolma chiring,2023-10-07T01:13:38Z,0,긍정
3,[시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...,2023-09-25T01:45:14Z,651,진교훈후보님. 승리바랍니다. 힘차게응원하겠습니다,혜숙 이,2023-10-06T03:06:31Z,0,긍정
4,[시선집중] 진교훈 강서구청장 후보에게 듣다 - 진교훈 더불어민주당 강서구청장 후보...,2023-09-25T01:45:14Z,651,응원할게요 꼭승리합니다,오 상훈,2023-10-05T15:06:49Z,0,긍정
...,...,...,...,...,...,...,...,...
913,서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS,2023-10-05T13:54:15Z,66,내년 총선 선거 민주당 인간들 선거하지 않습니다. 우 👎🏻👎🏻...,이순임,2023-10-05T18:20:25Z,0,긍정
914,서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS,2023-10-05T13:54:15Z,66,이겻다진교훈강서구청장당선을추카함다!?~,가나다,2023-10-05T14:58:58Z,3,부정
915,서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS,2023-10-05T13:54:15Z,66,"김태우는 공약을 국가 예산을 끌어 올 수 있다, 대통령한테 전화만 하면 된다. 서울...",Lee Park,2023-10-05T14:24:48Z,4,부정
916,서울 강서구청장 보궐선거 사전투표 D-1..진교훈 민주당 후보 유세 현장 / SBS,2023-10-05T13:54:15Z,66,봉투당은 절대 안됩니다\n나라를 망치는 문이기에,할렐루야김,2023-10-05T13:57:26Z,0,부정


In [24]:
df_kim

Unnamed: 0,동영상 제목,게시일,영상 좋아요 수,댓글,작성자,댓글 작성일,댓글 좋아요 수,예측 감정
0,"사면된 김태우, 다시 강서구청장?‥""사법부에 대한 정면도전"" (2023.08.14/...",2023-08-14T11:25:02Z,3334,이런 미친짓을 일삼는 윤석렬은 기본이 안된 양아치 같은 놈이네. 검사를 했다는 인간...,munki lee,2023-10-09T22:02:05Z,0,부정
1,"사면된 김태우, 다시 강서구청장?‥""사법부에 대한 정면도전"" (2023.08.14/...",2023-08-14T11:25:02Z,3334,미친것들 그렇게도 인물이 없나? 김태우를 다시 쓰게? 유권자들이 정신 차...,김오순,2023-10-09T02:10:34Z,0,부정
2,"사면된 김태우, 다시 강서구청장?‥""사법부에 대한 정면도전"" (2023.08.14/...",2023-08-14T11:25:02Z,3334,야참한심하고. 짝이업네! 구치소에서나온사람이. 강서구출마. 한다. 국민...,완태 남,2023-10-08T06:08:17Z,0,부정
3,"사면된 김태우, 다시 강서구청장?‥""사법부에 대한 정면도전"" (2023.08.14/...",2023-08-14T11:25:02Z,3334,민주주의 삼권분립을 찍어 누르는 ㄷㅅ들,박근영,2023-10-08T04:29:22Z,0,부정
4,"사면된 김태우, 다시 강서구청장?‥""사법부에 대한 정면도전"" (2023.08.14/...",2023-08-14T11:25:02Z,3334,또 뽑으면 진짜 ㅋㅋㅋㅋㅋㅋ 답없다 ㅋㅋㅋㅋ 국민 수준임,H,2023-10-08T00:18:04Z,0,부정
...,...,...,...,...,...,...,...,...
981,"[생중계] 김태우 국민의힘 강서구청장 후보, 총력유세! (2023.10.05 오후)",2023-10-05T10:09:51Z,89,무뇌아집결?,심플한늑대,2023-10-09T16:17:48Z,0,부정
982,"[생중계] 김태우 국민의힘 강서구청장 후보, 총력유세! (2023.10.05 오후)",2023-10-05T10:09:51Z,89,조원진\n지구를 떠나거라\n\n박근혜 사진걸고\n연명하더니 \nㅈ됐네,김귀삼,2023-10-09T16:07:41Z,0,부정
983,"[생중계] 김태우 국민의힘 강서구청장 후보, 총력유세! (2023.10.05 오후)",2023-10-05T10:09:51Z,89,불쌍한 국민의 힘\n뉴라이트 집단\n정광훈 똘아이 \n광신도만. 남았네\nㅉ ㅉ,김귀삼,2023-10-09T16:04:46Z,0,부정
984,"[생중계] 김태우 국민의힘 강서구청장 후보, 총력유세! (2023.10.05 오후)",2023-10-05T10:09:51Z,89,어디 촌동네 가서 유세하신건가요? 뭔 사람이 일케없어 ㅋㅋ,유은호,2023-10-09T16:01:55Z,2,부정


In [28]:
df_jin.to_csv('predictions_jin.csv', index=False, encoding='utf-8')
df_kim.to_csv('predictions_kim.csv', index=False, encoding='utf-8')
df_kwon.to_csv('predictions_kwon.csv', index=False, encoding='utf-8')