# Классификация текстов

## Сеть прямого распространения  для классификации текстов


![title](img/mlp.png)

* $x$ - входное векторное представление текста
* $h$ – скрытые слои с нелинейными функциями активации
* $y$ – выходы, как правило, один $y$ соответствует одной метке класса 

$NN_{MLP2}(x) = y$

$h_1 = g^1(xW^1 + b^1)$

$h_2 = g^2(h^1 W^2 + b^2)$

$y = h^2 W^3$

### Нелинейные функции активации

![title](img/activation.png)

### Обучение сети 
### Алгоритм обратного распространения ошибки 

Ошибка: cross entropy: $\text{loss}(y_{true}, \hat{y}_{pred}) = \sum y_{true} \log(\hat{y}_{pred})$ 

1. Прямой проход:
    * вычислить $\hat{y}_{pred}$ с текущими весами на скрытых слоях
    * оценить $\text{loss}(y_{true}, \hat{y}_{pred})$

2. Обратный проход:
    * оценить  $\Delta W_h$ на каждом скрытом слое
    
    $\Delta W_h = \frac {\partial  \text{loss}}{\partial W_{H}} = \frac{\partial \text{loss}}{\partial \hat{y}_{pred}}  \frac{\partial \hat{y}_{pred}}{\partial W_{H}} $
    
    * обновление весов: $ \Delta W_H =-\eta {\frac {\partial \text{loss} } {\partial W_H} }$

### dropout-регуляризация

$NN_{MLP2}(x) = y$

$h_1 = g^1(xW^1 + b^1)$

$m^1 $~$ Bernouli(r^1)$

$\hat{h^1} = m^1 \odot h^1$

$h_2 = g^2(\hat{h^1} W^2 + b^2)$

$m^2 $~$  Bernouli(r^2)$

$\hat{h^2} = m^2 \odot h^2$

$y =\hat{h^2} W^3$



### Векторное представление текста 


1. Мешок слов [Bag of Words, BoW]
    * $|\text{word} \in V| = N$ – словарь
    * $x \in D$ – документ, $|x| = k$ 
    * $\vec{x}$ – $N$-мерный вектор, $\vec{x}_i = f(\text{word}_i, x_i)$, в котором $k$  ненулевых компонент
        \end{itemize}

2. Распределенное представление слов [Continuous Bag of Words, CBoW])
    * one-hot кодировка: каждое слово $\text{word}$ – $N$-мерный вектор, $\overrightarrow{\text{word}}_i = 1$, иначе – 0
    * плотные вектора – эмбеддинги: каждое слово $\text{word}$ – $d$-мерный вектор, $\overrightarrow{\text{word}}_i \in \mathbb{R}$
	
    Матрица эмбеддингов: $E \in \mathbb{R}^{|V| \times d}$
	
    * $\text{CBOW}(x) = \frac{1}{k} \sum_i^k E_i $
    * $\text{x} = [\overrightarrow{\text{word}}_1  ,\ldots, \overrightarrow{\text{word}}_k ]$


#### Padding
Входные тексты имеют переменную длинну, что неудобно, поэтому предположим, что они все состоят из одинакового количества слов, только часть из этих слов – баластные символы pad.


#### Неизвестные слова (OOV)
Если в тестовом множестве встретилось неизвестное слово, то можно 
* заменить его на pad;
* заменить его на unk.  Однако в обучающем множестве unk никогда не встречается, поэтому его нужно добавить в обучающее множество искусственным образом. 


#### Word dropout - регуляризация 
Заменяем каждое слово на unk с вероятностью $\frac{\alpha}{|V| + \alpha}$


In [1]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

from keras.layers import Embedding, Input, Conv1D, MaxPooling1D, Flatten, Dense, Dropout
from keras.models import Model, Sequential

import pandas as pd
import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import random
random.seed(1228)

from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report, confusion_matrix

%matplotlib inline

from warnings import filterwarnings

filterwarnings('ignore')

In [2]:
from pymystem3 import Mystem
import re
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

m = Mystem()


regex = re.compile("[А-Яа-я:=!\)\()A-z\_\%/|]+")

def words_only(text, regex=regex):
    try:
        return " ".join(regex.findall(text))
    except:
        return ""



def lemmatize(text, mystem=m):
    try:
        return "".join(m.lemmatize(text)).strip()  
    except:
        return " "


df_neg = pd.read_csv("datasets/nlp/negative.csv", sep=';', header = None, usecols = [3])
df_pos = pd.read_csv("datasets/nlp/positive.csv", sep=';', header = None, usecols = [3])
df_neg['sent'] = 'neg'
df_pos['sent'] = 'pos'
df = pd.concat([df_neg, df_pos])
df = df[15000:20000]
df.columns = ['text', 'sent']
df.text = df.text.apply(words_only)
df.text = df.text.apply(lemmatize)


X = df.text.tolist()
y = df.sent.tolist()

X, y = np.array(X), np.array(y)

X_text_train, X_text_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
print ("total train examples %s" % len(y_train))
print ("total test examples %s" % len(y_test))

total train examples 3350
total test examples 1650


In [3]:
TEXT_LENGTH = 10
VOCABULARY_SIZE = 20000
EMBEDDING_DIM = 100
DIMS = 250
MAX_FEATURES = 5000

batch_size = 32
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 5

## Сеть прямого распостранения
## BoW

In [4]:
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(X_text_train)
tokenizer

<keras_preprocessing.text.Tokenizer at 0x1c910e5ef40>

In [5]:
sequences = tokenizer.texts_to_sequences(X_text_train)
X_train = tokenizer.sequences_to_matrix(sequences, mode='count')
sequences = tokenizer.texts_to_sequences(X_text_test)
X_test = tokenizer.sequences_to_matrix(sequences, mode='count')

In [6]:
print('First seq:', sequences[0])
print('First doc:', X_train[0])

First seq: [1098, 2, 3303, 32, 18, 50, 9, 792, 12, 582]
First doc: [0. 0. 0. ... 0. 0. 0.]


In [7]:
le = LabelEncoder()
le.fit(['pos', 'neg'])
y_train_cat = np_utils.to_categorical(le.transform(y_train), 2)
y_test_cat = np_utils.to_categorical(le.transform(y_test), 2)


print(y_train_cat[0])

[1. 0.]


In [8]:
model = Sequential()
model.add(Dense(128, input_shape=(MAX_FEATURES, ), activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train_cat, epochs=nb_epoch, batch_size=batch_size, validation_split=.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c94cb43e80>

In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 128)               640128    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258       
Total params: 640,386
Trainable params: 640,386
Non-trainable params: 0
_________________________________________________________________


In [10]:
pred = model.predict_classes(X_test)
pred = le.inverse_transform(pred)

AttributeError: 'Sequential' object has no attribute 'predict_classes'

In [None]:
from sklearn.metrics import *



print("Precision: {0:6.2f}".format(precision_score(y_test, pred, average='macro')))
print("Recall: {0:6.2f}".format(recall_score(y_test, pred, average='macro')))
print("F1-measure: {0:6.2f}".format(f1_score(y_test, pred, average='macro')))
print("Accuracy: {0:6.2f}".format(accuracy_score(y_test, pred)))
print(classification_report(y_test, pred))



sns.heatmap(data=confusion_matrix(y_test, pred), annot=True, fmt="d", cbar=False, xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion matrix")
plt.show()

### CBoW - слечайно инициализированные эмбеддиги

In [None]:
sequences = tokenizer.texts_to_sequences(X_text_train)
X_train = pad_sequences(sequences, mexlen=TEXT_LENGTH)
sequences = tokenizer.texts_to_sequences(X_text_test)
X_test = pad_sequences(sequnces, maxlen=TEXT_LENGTH)

In [None]:
X_train[0]

In [None]:
model = Sequental()
model.add(Embedding(VOCABULARY_SIZE, EMBEDDING_DIM, input_length=TEXT_LENGTH, trainable=True ))
model.add(Flatten())
model.add(Dense(128))
model.add(Dropout(.1))
model.add(Dense(2, activation='softmax'))
model.compile(loss=('categorical_crossentropy', optimizer='adam', metrics=['accuracy']))
model.fit(X_train, y_train_cat, epochs=nb_epochs, batch_size=batch_size, validation_split=.1)

In [None]:
pred = model.predict_classes(X_test)
pred = le.inverse_transform(pred)

In [None]:
from sklearn.metrics import *



print("Precision: {0:6.2f}".format(precision_score(y_test, pred, average='macro')))
print("Recall: {0:6.2f}".format(recall_score(y_test, pred, average='macro')))
print("F1-measure: {0:6.2f}".format(f1_score(y_test, pred, average='macro')))
print("Accuracy: {0:6.2f}".format(accuracy_score(y_test, pred)))
print(classification_report(y_test, pred))



sns.heatmap(data=confusion_matrix(y_test, pred), annot=True, fmt="d", cbar=False, xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion matrix")
plt.show()

In [None]:
%%time

import numpy as np

emb_path = 'datasets/nlp/wiki.ru.vec'

words = []

embeddings_index = {}
f = open(emb_path)
for line in f:
    values - line.split()
    if len(values) == 301:
        word = values[0]
        words.append(words)
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
f.close()

In [None]:
print(len(embeddings_index))

In [None]:
word_index = tokenizer.word_index
len(word_index)

In [None]:
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                            300,
                            weights=[embedding_matrix],
                            input_length=TEXT_LENGTH,
                            trainable=False))
model.add(Flatten())
model.add(Dense(128))
model.add(Dense(2, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train_cat, epochs=nb_epoch, batch_size=batch_size,  validation_split=0.1)


In [None]:
from sklearn.metrics import *



print("Precision: {0:6.2f}".format(precision_score(y_test, pred, average='macro')))
print("Recall: {0:6.2f}".format(recall_score(y_test, pred, average='macro')))
print("F1-measure: {0:6.2f}".format(f1_score(y_test, pred, average='macro')))
print("Accuracy: {0:6.2f}".format(accuracy_score(y_test, pred)))
print(classification_report(y_test, pred))



sns.heatmap(data=confusion_matrix(y_test, pred), annot=True, fmt="d", cbar=False, xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion matrix")
plt.show()