Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [1]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2024-05-15 16:28:31--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1,2M) [text/plain]
Saving to: ‘Constraint_Train.csv.1’


2024-05-15 16:28:32 (5,94 MB/s) - ‘Constraint_Train.csv.1’ saved [1253562/1253562]



In [30]:
import pandas as pd
from nltk.tokenize import word_tokenize
from tqdm import tqdm
import nltk
nltk.download('punkt')
import numpy as np

[nltk_data] Downloading package punkt to /home/vyacheslav/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
df = pd.read_csv('Constraint_Train.csv')

In [4]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [7]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:00<00:00, 8009.03it/s]


In [12]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, vector_size=300, min_count=3, window=5, epochs=15)

CPU times: user 3.45 s, sys: 0 ns, total: 3.45 s
Wall time: 1.06 s


In [13]:
model_tweets.wv.most_similar('france')

[('front', 0.918729305267334),
 ('bags', 0.9134325385093689),
 ('road', 0.9080644845962524),
 ('nairobi', 0.9047010540962219),
 ('singing', 0.9011433124542236),
 ('parliament', 0.9003052711486816),
 ('mall', 0.8977693319320679),
 ('pondicherry', 0.8972685933113098),
 ('student', 0.8953002691268921),
 ('tower', 0.8932645916938782)]

In [14]:
model_tweets.init_sims()

  model_tweets.init_sims()


In [16]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [17]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:01<00:00, 6070.05it/s]


In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Обучение модели на эмбедингах

In [19]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)

In [31]:
model = LogisticRegression()
model.fit(X_train, y_train)

predicted = model.predict(X_test)

print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.89      0.92      0.90      1000
        real       0.92      0.90      0.91      1119

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


###  Что будет, если использовать самый наивный метод?

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
bow = vec.fit_transform(df.tweet)
model = LogisticRegression()
model.fit(X_train, y_train)

predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.89      0.92      0.90      1000
        real       0.92      0.90      0.91      1119

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Обучим модель с помощью tf -idf преобразования

In [38]:
x_train, x_test, y_train, y_test = train_test_split(df.tweet, df.label)

vec = TfidfVectorizer()
bow = vec.fit_transform(x_train)
clf = LogisticRegression(random_state=42, solver='liblinear')
clf.fit(bow, y_train)
pred = clf.predict(vec.transform(x_test))
print(classification_report(pred, y_test))

              precision    recall  f1-score   support

        fake       0.90      0.94      0.92       755
        real       0.94      0.91      0.93       850

    accuracy                           0.92      1605
   macro avg       0.92      0.93      0.92      1605
weighted avg       0.93      0.92      0.92      1605



### PyTorch + LSTM

In [39]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [40]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [41]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [42]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [43]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [44]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [45]:
features = [get_word_embedding(text, 200) for text in tqdm(token_lists)]

100%|██████████| 6420/6420 [00:01<00:00, 5070.86it/s]


In [46]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)

In [47]:
import torch
import torch.nn as nn
import torch.optim as optim

In [48]:
len(features[0][0])

300

In [49]:
len(X_train)

4301

In [50]:
len(X_train[0])

200

In [51]:
len(X_train[0][0])

300

In [52]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [53]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

  in_data = torch.tensor(X_train).float()


In [54]:
in_data.shape

torch.Size([4301, 200, 300])

In [55]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [56]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [57]:
train_one_epoch(in_data, targets)

100%|██████████| 269/269 [01:20<00:00,  3.32it/s]

tensor(0.6854, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [58]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [59]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [60]:
result = (output > 0.5) == targets_test

In [61]:
result.sum().item() / len(result)

0.5276073619631901

Но такую модель надо учить дольше(