Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

# 1. Загрузка библиотек

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import optuna
from copy import deepcopy

from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, auc, classification_report, accuracy_score

import torch
import torch.nn as nn
import torch.optim as optim

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\79111\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2. Загрузка и подготовка данных

In [2]:
# https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
df = pd.read_csv('Constraint_Train.csv')
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


## 2.1 Данные на основе эмбеддингов слов

In [3]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]
model_tweets = Word2Vec(sentences, workers=1, vector_size=300, min_count=3, window=5, epochs=50)
model_tweets.wv.most_similar('vaccine')

100%|██████████| 6420/6420 [00:02<00:00, 2574.51it/s]


[('vaccines', 0.5229357481002808),
 ('drug', 0.4898321330547333),
 ('bharat', 0.42307907342910767),
 ('biotech', 0.42062073945999146),
 ('cure', 0.41453468799591064),
 ('trials', 0.3973405659198761),
 ('therapeutics', 0.3923570215702057),
 ('priority', 0.38474297523498535),
 ('developed', 0.3773641884326935),
 ('trial', 0.37083232402801514)]

In [4]:
def get_text_embedding(text):
    result = []
    for word in text:
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])
    if len(result):
        return np.sum(result, axis=0)
    else:
        return np.zeros(300)

In [5]:
features = [get_text_embedding(text) for text in sentences]
features = np.array(features)

labels = (df.label == 'real').astype(int).to_list()
labels = np.array(labels)

X_train_e, X_test_e, y_train_e, y_test_e = train_test_split(features, labels, test_size=0.33, stratify=labels, random_state=42)

## 2.2 Данные на основе мешка слов

In [6]:
vec = CountVectorizer()
bow = vec.fit_transform(df.tweet)
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(bow, labels, test_size=0.33, stratify=labels, random_state=42)

# 3. SKlearn

## 3.1 Логистическая регрессия с подбором гиперпараметров

In [7]:
def objective_logreg(trial):
    param = {
        'C': trial.suggest_float('C', 0.001, 1.),
        'solver': trial.suggest_categorical('solver', ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']),
        'max_iter': 1000
    }
    if param['solver'] in ['lbfgs', 'newton-cg', 'newton-cholesky', 'sag']:
        param['penalty'] = 'l2'
    elif param['solver'] in ['liblinear', 'saga']:
        param['penalty'] = trial.suggest_categorical('penalty', ['l1', 'l2'])
    
    model_cv = LogisticRegression(**param)
    model_cv.fit(X_train_cv, y_train_cv)
    
    preds = model_cv.predict_proba(X_test_cv)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test_cv, preds)
    pr_auc = auc(recall, precision)
    return pr_auc

### 3.1.1 Данные на основе эмбеддингов слов

In [13]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_train_e, y_train_e, test_size=0.2, stratify=y_train_e,
                                                                random_state=42)
study = optuna.create_study(direction='maximize')
study.optimize(objective_logreg, n_trials=50)

[I 2024-01-10 12:54:54,223] A new study created in memory with name: no-name-140f1b07-785b-4da9-9acc-9228a08e606c
[I 2024-01-10 12:54:54,343] Trial 0 finished with value: 0.9821521533545223 and parameters: {'C': 0.44520073827364903, 'solver': 'newton-cholesky'}. Best is trial 0 with value: 0.9821521533545223.
[I 2024-01-10 12:54:55,984] Trial 1 finished with value: 0.9822987906890318 and parameters: {'C': 0.9008443314477318, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 1 with value: 0.9822987906890318.
[I 2024-01-10 12:54:56,069] Trial 2 finished with value: 0.9827917359575997 and parameters: {'C': 0.057477884076623194, 'solver': 'newton-cholesky'}. Best is trial 2 with value: 0.9827917359575997.
[I 2024-01-10 12:55:06,808] Trial 3 finished with value: 0.9830562882931673 and parameters: {'C': 0.5798986114854214, 'solver': 'sag'}. Best is trial 3 with value: 0.9830562882931673.
[I 2024-01-10 12:55:08,627] Trial 4 finished with value: 0.9822159633498113 and parameters: {'C': 0.

[I 2024-01-10 12:59:46,705] Trial 42 finished with value: 0.9830651743967154 and parameters: {'C': 0.5888041315842195, 'solver': 'sag'}. Best is trial 10 with value: 0.9833400601471607.
[I 2024-01-10 12:59:57,240] Trial 43 finished with value: 0.983062778687961 and parameters: {'C': 0.6067119821506444, 'solver': 'sag'}. Best is trial 10 with value: 0.9833400601471607.
[I 2024-01-10 12:59:57,795] Trial 44 finished with value: 0.9821615939495832 and parameters: {'C': 0.7023491404234278, 'solver': 'newton-cg'}. Best is trial 10 with value: 0.9833400601471607.
[I 2024-01-10 13:00:03,617] Trial 45 finished with value: 0.983062778687961 and parameters: {'C': 0.656892765762819, 'solver': 'sag'}. Best is trial 10 with value: 0.9833400601471607.
[I 2024-01-10 13:00:05,824] Trial 46 finished with value: 0.9825609391796143 and parameters: {'C': 0.4483345450884514, 'solver': 'liblinear', 'penalty': 'l1'}. Best is trial 10 with value: 0.9833400601471607.
[I 2024-01-10 13:00:14,948] Trial 47 finishe

In [14]:
trial = study.best_trial
model = LogisticRegression(**trial.params)
model.fit(X_train_e, y_train_e)
preds = model.predict(X_test_e)
print(classification_report(y_test_e, preds))

              precision    recall  f1-score   support

           0       0.92      0.93      0.93      1010
           1       0.94      0.93      0.93      1109

    accuracy                           0.93      2119
   macro avg       0.93      0.93      0.93      2119
weighted avg       0.93      0.93      0.93      2119



### 3.1.2 Данные на основе мешка слов

In [15]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_train_b, y_train_b, test_size=0.2, stratify=y_train_b,
                                                                random_state=42)
study = optuna.create_study(direction='maximize')
study.optimize(objective_logreg, n_trials=50)

[I 2024-01-10 13:00:23,395] A new study created in memory with name: no-name-9a267fb1-60e4-4cb9-b536-6780085ce568
[I 2024-01-10 13:00:24,792] Trial 0 finished with value: 0.9828698733810738 and parameters: {'C': 0.8693431650983111, 'solver': 'sag'}. Best is trial 0 with value: 0.9828698733810738.
[I 2024-01-10 13:00:24,823] Trial 1 finished with value: 0.9850367274149501 and parameters: {'C': 0.9198241404525309, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 1 with value: 0.9850367274149501.
[I 2024-01-10 13:00:24,849] Trial 2 finished with value: 0.9849825794442911 and parameters: {'C': 0.4444860169701211, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 1 with value: 0.9850367274149501.
[I 2024-01-10 13:00:24,879] Trial 3 finished with value: 0.9848215217643891 and parameters: {'C': 0.3459900981787706, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 1 with value: 0.9850367274149501.
[I 2024-01-10 13:00:28,861] Trial 4 finished with value: 0.9625866413746507 and p

[I 2024-01-10 13:15:11,350] Trial 41 finished with value: 0.9851572355136117 and parameters: {'C': 0.5564706684348318, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 35 with value: 0.9851975275859853.
[I 2024-01-10 13:15:11,382] Trial 42 finished with value: 0.9851845256582701 and parameters: {'C': 0.5480582690152881, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 35 with value: 0.9851975275859853.
[I 2024-01-10 13:15:11,416] Trial 43 finished with value: 0.9849853611794486 and parameters: {'C': 0.4217503525991295, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 35 with value: 0.9851975275859853.
[I 2024-01-10 13:15:11,516] Trial 44 finished with value: 0.9848632961899934 and parameters: {'C': 0.5320002471962222, 'solver': 'newton-cg'}. Best is trial 35 with value: 0.9851975275859853.
[I 2024-01-10 13:15:11,552] Trial 45 finished with value: 0.984845878047521 and parameters: {'C': 0.37388745918224586, 'solver': 'liblinear', 'penalty': 'l2'}. Best is trial 35 with

In [16]:
trial = study.best_trial
model = LogisticRegression(**trial.params)
model.fit(X_train_b, y_train_b)
preds = model.predict(X_test_b)
print(classification_report(y_test_b, preds))

              precision    recall  f1-score   support

           0       0.91      0.94      0.92      1010
           1       0.94      0.92      0.93      1109

    accuracy                           0.93      2119
   macro avg       0.93      0.93      0.93      2119
weighted avg       0.93      0.93      0.93      2119



## 3.2 Случайный лес с подбором гиперпараметров

In [17]:
def objective_forest(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 10, 100),
        'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy', 'log_loss']),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 200),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    }
    
    model_cv = RandomForestClassifier(**param, random_state=42)
    model_cv.fit(X_train_cv, y_train_cv)
    
    preds = model_cv.predict_proba(X_test_cv)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test_cv, preds)
    pr_auc = auc(recall, precision)
    return pr_auc

### 3.2.1 Данные на основе эмбеддингов слов

In [18]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_train_e, y_train_e, test_size=0.2, stratify=y_train_e,
                                                                random_state=42)
study = optuna.create_study(direction='maximize')
study.optimize(objective_forest, n_trials=50)

[I 2024-01-10 13:19:52,767] A new study created in memory with name: no-name-5419c724-1ba8-4deb-8978-96180755f2c1
[I 2024-01-10 13:19:53,693] Trial 0 finished with value: 0.9779330088730348 and parameters: {'n_estimators': 43, 'criterion': 'log_loss', 'min_samples_split': 29, 'max_features': 'log2'}. Best is trial 0 with value: 0.9779330088730348.
[I 2024-01-10 13:19:55,070] Trial 1 finished with value: 0.9706404952996375 and parameters: {'n_estimators': 87, 'criterion': 'entropy', 'min_samples_split': 150, 'max_features': 'log2'}. Best is trial 0 with value: 0.9779330088730348.
[I 2024-01-10 13:19:56,333] Trial 2 finished with value: 0.9654519942081279 and parameters: {'n_estimators': 93, 'criterion': 'gini', 'min_samples_split': 183, 'max_features': 'log2'}. Best is trial 0 with value: 0.9779330088730348.
[I 2024-01-10 13:19:58,709] Trial 3 finished with value: 0.9762885233716974 and parameters: {'n_estimators': 68, 'criterion': 'entropy', 'min_samples_split': 73, 'max_features': 'sq

[I 2024-01-10 13:23:56,712] Trial 35 finished with value: 0.9790192762309 and parameters: {'n_estimators': 94, 'criterion': 'entropy', 'min_samples_split': 39, 'max_features': 'log2'}. Best is trial 10 with value: 0.9841865491451043.
[I 2024-01-10 13:23:58,733] Trial 36 finished with value: 0.9766993764475018 and parameters: {'n_estimators': 75, 'criterion': 'gini', 'min_samples_split': 50, 'max_features': 'log2'}. Best is trial 10 with value: 0.9841865491451043.
[I 2024-01-10 13:24:40,698] Trial 37 finished with value: 0.9675553568510521 and parameters: {'n_estimators': 83, 'criterion': 'entropy', 'min_samples_split': 161, 'max_features': None}. Best is trial 10 with value: 0.9841865491451043.
[I 2024-01-10 13:24:41,647] Trial 38 finished with value: 0.9711652966583157 and parameters: {'n_estimators': 65, 'criterion': 'gini', 'min_samples_split': 125, 'max_features': 'log2'}. Best is trial 10 with value: 0.9841865491451043.
[I 2024-01-10 13:24:43,842] Trial 39 finished with value: 0.9

In [19]:
trial = study.best_trial
model = RandomForestClassifier(**trial.params, random_state=42)
model.fit(X_train_e, y_train_e)
preds = model.predict(X_test_e)
print(classification_report(y_test_e, preds))

              precision    recall  f1-score   support

           0       0.94      0.93      0.93      1010
           1       0.94      0.94      0.94      1109

    accuracy                           0.94      2119
   macro avg       0.94      0.94      0.94      2119
weighted avg       0.94      0.94      0.94      2119



### 3.2.2 Данные на основе мешка слов

In [20]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_train_b, y_train_b, test_size=0.2, stratify=y_train_b,
                                                                random_state=42)
study = optuna.create_study(direction='maximize')
study.optimize(objective_forest, n_trials=50)

[I 2024-01-10 13:25:22,247] A new study created in memory with name: no-name-a7a9edda-2341-42bb-b421-a0e572d320aa
[I 2024-01-10 13:25:23,794] Trial 0 finished with value: 0.9756986054995905 and parameters: {'n_estimators': 85, 'criterion': 'entropy', 'min_samples_split': 188, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.9756986054995905.
[I 2024-01-10 13:25:26,612] Trial 1 finished with value: 0.9790409175743976 and parameters: {'n_estimators': 95, 'criterion': 'entropy', 'min_samples_split': 40, 'max_features': 'sqrt'}. Best is trial 1 with value: 0.9790409175743976.
[I 2024-01-10 13:25:36,785] Trial 2 finished with value: 0.9635052021547006 and parameters: {'n_estimators': 65, 'criterion': 'log_loss', 'min_samples_split': 177, 'max_features': None}. Best is trial 1 with value: 0.9790409175743976.
[I 2024-01-10 13:25:38,961] Trial 3 finished with value: 0.980395691419845 and parameters: {'n_estimators': 83, 'criterion': 'log_loss', 'min_samples_split': 84, 'max_features': 'l

[I 2024-01-10 13:27:44,608] Trial 35 finished with value: 0.9809937739935392 and parameters: {'n_estimators': 93, 'criterion': 'log_loss', 'min_samples_split': 44, 'max_features': 'log2'}. Best is trial 26 with value: 0.9826630011168641.
[I 2024-01-10 13:27:47,487] Trial 36 finished with value: 0.9839383906187165 and parameters: {'n_estimators': 87, 'criterion': 'entropy', 'min_samples_split': 16, 'max_features': 'log2'}. Best is trial 36 with value: 0.9839383906187165.
[I 2024-01-10 13:28:04,915] Trial 37 finished with value: 0.9706361718090337 and parameters: {'n_estimators': 87, 'criterion': 'entropy', 'min_samples_split': 29, 'max_features': None}. Best is trial 36 with value: 0.9839383906187165.
[I 2024-01-10 13:28:08,741] Trial 38 finished with value: 0.9785159286581282 and parameters: {'n_estimators': 85, 'criterion': 'entropy', 'min_samples_split': 72, 'max_features': 'sqrt'}. Best is trial 36 with value: 0.9839383906187165.
[I 2024-01-10 13:28:11,321] Trial 39 finished with va

In [21]:
trial = study.best_trial
model = RandomForestClassifier(**trial.params, random_state=42)
model.fit(X_train_b, y_train_b)
preds = model.predict(X_test_b)
print(classification_report(y_test_b, preds))

              precision    recall  f1-score   support

           0       0.90      0.94      0.92      1010
           1       0.95      0.90      0.92      1109

    accuracy                           0.92      2119
   macro avg       0.92      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119



# 4. PyTorch

In [22]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [23]:
features = [get_word_embedding(text, 200) for text in tqdm(sentences)]
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, stratify=labels, random_state=4)

X_train = torch.tensor(X_train).float()
y_train = torch.tensor(y_train).float()

X_test = torch.tensor(X_test).float()
y_test = torch.tensor(y_test).float()

X_train.shape

100%|██████████| 6420/6420 [00:01<00:00, 4390.25it/s]


torch.Size([4301, 200, 300])

## 4.1 RNN с LSTM слоем

In [24]:
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)
    
    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction

In [25]:
def train_one_epoch(X_train, y_train, batch_size=16):
    for i in tqdm(range(0, len(X_train), batch_size)):
        batch_x = X_train[i:i + batch_size]
        batch_y = y_train[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [26]:
net = Net()
print(net)
optimizer = optim.Adam(net.parameters(), lr=0.003)
criterion = nn.BCELoss()

best_net = None
best_epoch = None
best_score = 0

for epoch in range(10):
    train_one_epoch(X_train, y_train, batch_size=16)
    
    with torch.no_grad():
        output = net(X_test).reshape(-1)
    preds = (output > 0.5).numpy().astype(int)
    score = accuracy_score(y_test, preds)
    print(f'Epoch {epoch} accuracy score:', score)
    
    if score > best_score:
        best_net = deepcopy(net)
        best_epoch = epoch
        best_score = score
    print(f'Best epoch {best_epoch}, Best score:', best_score)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


100%|██████████| 269/269 [00:05<00:00, 47.51it/s]


tensor(0.6829, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 0 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:05<00:00, 51.24it/s]


tensor(0.6808, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 1 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:06<00:00, 40.73it/s]


tensor(0.6796, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 2 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:06<00:00, 41.10it/s]


tensor(0.6791, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 3 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:05<00:00, 51.59it/s]


tensor(0.6786, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 4 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:05<00:00, 47.69it/s]


tensor(0.6783, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 5 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:09<00:00, 27.14it/s]


tensor(0.6784, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 6 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:06<00:00, 40.54it/s]


tensor(0.6780, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 7 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:05<00:00, 51.36it/s]


tensor(0.6776, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 8 accuracy score: 0.5243039169419538
Best epoch 0, Best score: 0.5243039169419538


100%|██████████| 269/269 [00:05<00:00, 50.77it/s]


tensor(0.7009, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 9 accuracy score: 0.4756960830580462
Best epoch 0, Best score: 0.5243039169419538


In [27]:
with torch.no_grad():
    output = best_net(X_test).reshape(-1)
preds = (output > 0.5).numpy().astype(int)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

         0.0       1.00      0.00      0.00      1010
         1.0       0.52      1.00      0.69      1109

    accuracy                           0.52      2119
   macro avg       0.76      0.50      0.35      2119
weighted avg       0.75      0.52      0.36      2119



## 4.2 RNN с двунаправленным LSTM слоем и дропаутом

In [28]:
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100, dropout=0.5, bidirectional=True)
        self.out = nn.Linear(100, 1)
    
    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction

In [29]:
def train_one_epoch(X_train, y_train, batch_size=16):
    for i in tqdm(range(0, len(X_train), batch_size)):
        batch_x = X_train[i:i + batch_size]
        batch_y = y_train[i:i + batch_size]
        batch_y = torch.concat((batch_y, torch.flip(batch_y, [0])))
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [30]:
net = Net()
print(net)
optimizer = optim.Adam(net.parameters(), lr=0.003)
criterion = nn.BCELoss()

best_net = None
best_epoch = None
best_score = 0

for epoch in range(10):
    train_one_epoch(X_train, y_train, batch_size=16)
    
    with torch.no_grad():
        output = net(X_test).reshape(-1)
    preds = (output > 0.5).numpy().astype(int)[len(y_test):]
    score = accuracy_score(y_test, preds)
    print(f'Epoch {epoch} accuracy score:', score)
    
    if score > best_score:
        best_net = deepcopy(net)
        best_epoch = epoch
        best_score = score
    print(f'Best epoch {best_epoch}, Best score:', best_score)

Net(
  (lstm): LSTM(300, 100, dropout=0.5, bidirectional=True)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


100%|██████████| 269/269 [00:17<00:00, 15.79it/s]


tensor(0.6792, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 0 accuracy score: 0.6857008022652195
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:10<00:00, 25.90it/s]


tensor(0.6583, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 1 accuracy score: 0.6781500707881076
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:19<00:00, 13.89it/s]


tensor(0.6443, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 2 accuracy score: 0.6319018404907976
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:10<00:00, 25.62it/s]


tensor(0.6050, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 3 accuracy score: 0.6238791882963662
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:17<00:00, 15.07it/s]


tensor(0.5894, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 4 accuracy score: 0.6219915054270883
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:10<00:00, 26.07it/s]


tensor(0.5739, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 5 accuracy score: 0.5946201038225578
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:21<00:00, 12.66it/s]


tensor(0.5348, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 6 accuracy score: 0.5663048607833884
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:10<00:00, 25.66it/s]


tensor(0.5090, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 7 accuracy score: 0.5516753185464842
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:17<00:00, 15.73it/s]


tensor(0.4485, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 8 accuracy score: 0.5507314771118452
Best epoch 0, Best score: 0.6857008022652195


100%|██████████| 269/269 [00:10<00:00, 26.15it/s]


tensor(0.4184, grad_fn=<BinaryCrossEntropyBackward0>)
Epoch 9 accuracy score: 0.5412930627654554
Best epoch 0, Best score: 0.6857008022652195


In [31]:
with torch.no_grad():
    output = best_net(X_test).reshape(-1)
preds = (output > 0.5).numpy().astype(int)[len(y_test):]
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

         0.0       0.66      0.70      0.68      1010
         1.0       0.71      0.67      0.69      1109

    accuracy                           0.69      2119
   macro avg       0.69      0.69      0.69      2119
weighted avg       0.69      0.69      0.69      2119



## 4.3 Полносвязная нейросеть, обученная на усредненных текстовых эмбеддингах

In [32]:
X_train = torch.tensor(X_train_e).float()
y_train = torch.tensor(y_train_e).float()

X_test = torch.tensor(X_test_e).float()
y_test = torch.tensor(y_test_e).float()

X_train.shape

torch.Size([4301, 300])

In [33]:
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.out = nn.Linear(300, 1)
    
    def forward(self, x):
        return torch.sigmoid(self.out(x))

In [34]:
def train_one_epoch(X_train, y_train, batch_size=16):
    for i in tqdm(range(0, len(X_train), batch_size)):
        batch_x = X_train[i:i + batch_size]
        batch_y = y_train[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [35]:
net = Net()
print(net)
optimizer = optim.Adam(net.parameters(), lr=0.003)
criterion = nn.BCEWithLogitsLoss()

best_net = None
best_epoch = None
best_score = 0

for epoch in range(50):
    train_one_epoch(X_train, y_train, batch_size=16)
    
    with torch.no_grad():
        output = net(X_test).reshape(-1)
    preds = (output > 0.5).numpy().astype(int)
    score = accuracy_score(y_test, preds)
    print(f'Epoch {epoch} accuracy score:', score)
    
    if score > best_score:
        best_net = deepcopy(net)
        best_epoch = epoch
        best_score = score
    print(f'Best epoch {best_epoch}, Best score:', best_score)

Net(
  (out): Linear(in_features=300, out_features=1, bias=True)
)


100%|██████████| 269/269 [00:00<00:00, 636.53it/s]


tensor(0.5189, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 0 accuracy score: 0.89806512505899
Best epoch 0, Best score: 0.89806512505899


100%|██████████| 269/269 [00:00<00:00, 707.90it/s]


tensor(0.5133, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 1 accuracy score: 0.8994808872109485
Best epoch 1, Best score: 0.8994808872109485


100%|██████████| 269/269 [00:00<00:00, 707.75it/s]


tensor(0.4913, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 2 accuracy score: 0.8961774421897122
Best epoch 1, Best score: 0.8994808872109485


100%|██████████| 269/269 [00:00<00:00, 704.62it/s]


tensor(0.4920, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 3 accuracy score: 0.9075035394053799
Best epoch 3, Best score: 0.9075035394053799


100%|██████████| 269/269 [00:00<00:00, 632.91it/s]


tensor(0.4890, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 4 accuracy score: 0.9098631429919773
Best epoch 4, Best score: 0.9098631429919773


100%|██████████| 269/269 [00:00<00:00, 356.61it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 5 accuracy score: 0.9089193015573384
Best epoch 4, Best score: 0.9098631429919773


100%|██████████| 269/269 [00:00<00:00, 355.50it/s]


tensor(0.4890, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 6 accuracy score: 0.9145823501651722
Best epoch 6, Best score: 0.9145823501651722


100%|██████████| 269/269 [00:00<00:00, 353.15it/s]


tensor(0.4887, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 7 accuracy score: 0.9159981123171307
Best epoch 7, Best score: 0.9159981123171307


100%|██████████| 269/269 [00:00<00:00, 407.87it/s]


tensor(0.4889, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 8 accuracy score: 0.9169419537517697
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 783.54it/s]


tensor(0.4887, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 9 accuracy score: 0.9065596979707409
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 777.55it/s]


tensor(0.4889, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 10 accuracy score: 0.9093912222746579
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 776.58it/s]


tensor(0.4891, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 11 accuracy score: 0.9122227465785748
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 773.27it/s]


tensor(0.4922, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 12 accuracy score: 0.9145823501651722
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 783.95it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 13 accuracy score: 0.9103350637092968
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 748.43it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 14 accuracy score: 0.9164700330344502
Best epoch 8, Best score: 0.9169419537517697


100%|██████████| 269/269 [00:00<00:00, 765.84it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 15 accuracy score: 0.9193015573383672
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 778.85it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 16 accuracy score: 0.9098631429919773
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 864.00it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 17 accuracy score: 0.9141104294478528
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 878.29it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 18 accuracy score: 0.9159981123171307
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 880.93it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 19 accuracy score: 0.9174138744690892
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 881.85it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 20 accuracy score: 0.9145823501651722
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 893.03it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 21 accuracy score: 0.9164700330344502
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 895.30it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 22 accuracy score: 0.9122227465785748
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 862.76it/s]


tensor(0.4891, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 23 accuracy score: 0.9164700330344502
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 900.40it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 24 accuracy score: 0.9193015573383672
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 729.99it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 25 accuracy score: 0.9136385087305333
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 732.33it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 26 accuracy score: 0.9098631429919773
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 985.91it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 27 accuracy score: 0.9174138744690892
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 1000.37it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 28 accuracy score: 0.9117508258612553
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 1000.91it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 29 accuracy score: 0.9193015573383672
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 1008.01it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 30 accuracy score: 0.9183577159037282
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 999.20it/s] 


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 31 accuracy score: 0.9159981123171307
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 1003.25it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 32 accuracy score: 0.9159981123171307
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 1004.04it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 33 accuracy score: 0.9178857951864087
Best epoch 15, Best score: 0.9193015573383672


100%|██████████| 269/269 [00:00<00:00, 999.74it/s] 


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 34 accuracy score: 0.9197734780556867
Best epoch 34, Best score: 0.9197734780556867


100%|██████████| 269/269 [00:00<00:00, 999.34it/s] 


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 35 accuracy score: 0.9122227465785748
Best epoch 34, Best score: 0.9197734780556867


100%|██████████| 269/269 [00:00<00:00, 993.89it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 36 accuracy score: 0.9131665880132138
Best epoch 34, Best score: 0.9197734780556867


100%|██████████| 269/269 [00:00<00:00, 1282.59it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 37 accuracy score: 0.9197734780556867
Best epoch 34, Best score: 0.9197734780556867


100%|██████████| 269/269 [00:00<00:00, 1445.39it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 38 accuracy score: 0.9188296366210477
Best epoch 34, Best score: 0.9197734780556867


100%|██████████| 269/269 [00:00<00:00, 1363.49it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 39 accuracy score: 0.9221330816422841
Best epoch 39, Best score: 0.9221330816422841


100%|██████████| 269/269 [00:00<00:00, 1446.97it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 40 accuracy score: 0.9221330816422841
Best epoch 39, Best score: 0.9221330816422841


100%|██████████| 269/269 [00:00<00:00, 1442.84it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 41 accuracy score: 0.9226050023596036
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1439.83it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 42 accuracy score: 0.9193015573383672
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1429.98it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 43 accuracy score: 0.9197734780556867
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1435.94it/s]


tensor(0.4888, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 44 accuracy score: 0.9183577159037282
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1433.29it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 45 accuracy score: 0.9221330816422841
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1444.28it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 46 accuracy score: 0.9188296366210477
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1428.13it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 47 accuracy score: 0.9202453987730062
Best epoch 41, Best score: 0.9226050023596036


100%|██████████| 269/269 [00:00<00:00, 1424.45it/s]


tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 48 accuracy score: 0.924020764511562
Best epoch 48, Best score: 0.924020764511562


100%|██████████| 269/269 [00:00<00:00, 1436.67it/s]

tensor(0.4886, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)
Epoch 49 accuracy score: 0.9183577159037282
Best epoch 48, Best score: 0.924020764511562





In [36]:
with torch.no_grad():
    output = best_net(X_test).reshape(-1)
preds = (output > 0.5).numpy().astype(int)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

         0.0       0.91      0.94      0.92      1010
         1.0       0.94      0.91      0.93      1109

    accuracy                           0.92      2119
   macro avg       0.92      0.92      0.92      2119
weighted avg       0.92      0.92      0.92      2119

