# Классификация текстов

Взять ноутбук colab_text_classification_part1.ipynb который разбирали на занятии и добавить пункты которые мы пропустили
1. Проверьте повысилось ли качество на стандартных подходах при лемматизации/и без неё
2. Удалите/(замените на тег) из текстов сущности(имена, локации, что-то ещё). Запустите классификатор и модельки на сеточках
3. Сделайте выводы

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv("train.tsv", delimiter="\t")
test_df = pd.read_csv("test.tsv", delimiter="\t")

In [3]:
print('Train size = {}'.format(len(train_df)))
print('Test size = {}'.format(len(test_df)))

Train size = 25000
Test size = 25000


In [4]:
train_df.head(5)

Unnamed: 0,is_positive,review
0,0,"Dreamgirls, despite its fistful of Tony wins i..."
1,0,This show comes up with interesting locations ...
2,1,I simply love this movie. I also love the Ramo...
3,0,Spoilers ahead if you want to call them that.....
4,1,My all-time favorite movie! I have seen many m...


In [5]:
#@title Начинаем классифицировать! { vertical-output: true, display-mode: "form" }
positive_words = 'love', 'great', 'best', 'wonderful' #@param {type:"raw"}
negative_words = 'worst', 'awful', '1/10', 'crap' #@param {type:"raw"}

positives_count = test_df.review.apply(lambda text: sum(word in text for word in positive_words))
negatives_count = test_df.review.apply(lambda text: sum(word in text for word in negative_words))
is_positive = positives_count > negatives_count
correct_count = (is_positive == test_df.is_positive).values.sum()

accuracy = correct_count / len(test_df)

print('Test accuracy = {:.2%}'.format(accuracy))
if accuracy > 0.71:
    from IPython.display import Image, display
    display(Image('https://s3.amazonaws.com/achgen360/t/rmmoZsub.png', width=500))

Test accuracy = 66.73%


**Задание** Придумайте хорошие ключевые слова или фразы и наберите хотя бы 71% точности на тесте (и не забудьте посмотреть на код классификации!)

**Done**  Took two lists of words and cut them both to the third part, reviewing the words one by one.  
The whole corpus didn't improve the results, even worsened a bit. In three tries the result gave the target 71%.

Original words' lists:
https://ptrckprry.com/course/ssd/data/positive-words.txt
https://ptrckprry.com/course/ssd/data/negative-words.txt

In [6]:
with open('positive_words.txt', 'r') as p:
    pos = p.read()
    positive_w = pos.split('\n')
    
with open('negative_words.txt', 'r') as n:
    neg = n.read()
    negative_w = neg.split('\n')


In [7]:
positive_w[:5]

['admired', 'admiring', 'admiringly', 'adorable', 'adored']

In [8]:
positives_count = test_df.review.apply(lambda text: sum(word in text for word in positive_w))
negatives_count = test_df.review.apply(lambda text: sum(word in text for word in negative_w))
is_positive = positives_count > negatives_count
correct_count = (is_positive == test_df.is_positive).values.sum()

accuracy = correct_count / len(test_df)

print('Test accuracy = {:.2%}'.format(accuracy))


Test accuracy = 71.68%


**Задание** Кому-нибудь нравятся эти `<br /><br />`? Лично мне - нет. Напишите регулярку, которая будет их удалять

In [9]:
import re
pattern = re.compile('<br />')

In [10]:
train_df['review'] = train_df['review'].apply(lambda text: pattern.subn(' ', text)[0])
test_df['review'] = test_df['review'].apply(lambda text: pattern.subn(' ', text)[0])

**Vectorizing**

```python
CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64'>)
```

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()


In [12]:
vectorizer = CountVectorizer()
vectorizer.fit(train_df['review'].values)

In [13]:
vectorizer.get_feature_names_out()

array(['00', '000', '0000000000001', ..., 'østbye', 'über', 'üvegtigris'],
      dtype=object)

In [14]:
vectorizer.transform([train_df['review'].iloc[3]])

<1x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 206 stored elements in Compressed Sparse Row format>

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


In [16]:
vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

In [17]:
model.fit(train_df['review'], train_df['is_positive'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
from sklearn.metrics import accuracy_score

def eval_model(model, test_df):
    preds = model.predict(test_df['review'])
    print('Test accuracy = {:.2%}'.format(accuracy_score(test_df['is_positive'], preds)))
    
eval_model(model, test_df)

Test accuracy = 86.37%


**Visualizing weights with eli5**

In [90]:
import eli5
eli5.show_weights(classifier, vec=vectorizer, top=20)

Weight?,Feature
… 10319 more positive …,… 10319 more positive …
… 9662 more negative …,… 9662 more negative …
-2.445,x5014
-2.449,x2920
-2.454,x19343
-2.488,x420
-2.570,x6897
-2.628,x19342
-2.630,x2919
-2.657,x2160


**NB: since scikit-learn==0.22 the words in a given text are not highlighted anymore and tokens are given as features**

In [91]:
print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, top=20,
                     targets=['positive'], target_names=['negative', 'positive'])

Positive


Contribution?,Feature
+0.113,x9227
+0.108,x1329
+0.084,x4246
+0.075,x17881
+0.073,x6343
+0.072,x9226
+0.068,x16059
+0.062,x1252
+0.056,x12799
+0.056,x2618


In [92]:
print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, top=20,
                     targets=['positive'], target_names=['negative', 'positive'])

Positive


Contribution?,Feature
+0.113,x9227
+0.108,x1329
+0.084,x4246
+0.075,x17881
+0.073,x6343
+0.072,x9226
+0.068,x16059
+0.062,x1252
+0.056,x12799
+0.056,x2618


In [93]:
print('Positive' if test_df['is_positive'].iloc[6] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[6], vec=vectorizer, top=20,
                     targets=['positive'], target_names=['negative', 'positive'])

Negative


Contribution?,Feature
+0.075,x9227
+0.066,x1329
+0.066,x6343
+0.063,x9226
+0.062,x4246
+0.058,x10859
+0.057,x17881
… 784 more positive …,… 784 more positive …
… 920 more negative …,… 920 more negative …
-0.053,x7025


In [94]:
import numpy as np

preds = model.predict(test_df['review'])
incorrect_pred_index = np.random.choice(np.where(preds != test_df['is_positive'])[0])

eli5.show_prediction(classifier, test_df['review'].iloc[incorrect_pred_index],
                     vec=vectorizer, top=20, targets=['positive'], target_names=['negative', 'positive'])

Contribution?,Feature
+0.091,x9227
+0.080,x1329
+0.079,x12799
+0.074,x16059
+0.072,x6343
+0.072,x10859
+0.071,x9226
… 2590 more positive …,… 2590 more positive …
… 2529 more negative …,… 2529 more negative …
-0.069,x3920


**Задание** Попробуйте запустить `TfidfVectorizer`. Посмотрите на ошибки, которые он научился исправлять, и на ошибки, которые он начал делать - по сравнению с `CountVectorizer`.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_df['review'], train_df['is_positive'])

eval_model(model, test_df)

Test accuracy = 88.28%


**Задание** Попробуйте увеличенный range и поинтерпретируйте полученный результат.

In [25]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_df['review'], train_df['is_positive'])

eval_model(model, test_df)

Test accuracy = 88.64%


**Задание** Запилите классификатор на n-граммах символов и визуализируйте его.

In [26]:
vectorizer = TfidfVectorizer(ngram_range=(2, 6), max_features=20000, analyzer='char')
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_df['review'], train_df['is_positive'])

eval_model(model, test_df)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Test accuracy = 87.86%


In [27]:
print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Positive


Contribution?,Feature
0.113,x9227
0.108,x1329
0.084,x4246
0.075,x17881
0.073,x6343
0.072,x9226
0.068,x16059
0.062,x1252
0.056,x12799
0.056,x2618


**Задание** Найти формы слова с разной семантической окраской.

Поверя, что они есть, попробуем что-нибудь с этим сделать.

Например, лемматизируем - сведем к начальной форме все слова. Поможет в этом библиотека spacy.

In [59]:
import spacy
from spacy import displacy
import en_core_web_sm
#import en

nlp = spacy.load("en_core_web_sm")

#nlp = spacy.load('en', disable=['parser'])

#

In [60]:
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'


rule


In [30]:
docs = [nlp(doc) for doc in train_df.review.values]

In [85]:
lemmas = []
for doc in docs:
    lemma = []
    for token in doc:
        lemma.append(''.join(token.lemma_))
    lemmas.append(lemma)


In [86]:
len(lemmas)

25000

In [87]:
train_df['lemmas'] = lemmas

In [88]:
train_df

Unnamed: 0,is_positive,review,lemmas
0,0,"Dreamgirls, despite its fistful of Tony wins i...","[dreamgirl, ,, despite, its, fistful, of, Tony..."
1,0,This show comes up with interesting locations ...,"[this, show, come, up, with, interesting, loca..."
2,1,I simply love this movie. I also love the Ramo...,"[I, simply, love, this, movie, ., I, also, lov..."
3,0,Spoilers ahead if you want to call them that.....,"[spoiler, ahead, if, you, want, to, call, they..."
4,1,My all-time favorite movie! I have seen many m...,"[my, all, -, time, favorite, movie, !, I, have..."
...,...,...,...
24995,1,"I am a big fan of the movie, but not for the u...","[I, be, a, big, fan, of, the, movie, ,, but, n..."
24996,0,I'm not going to bother with a plot synopsis s...,"[I, be, not, go, to, bother, with, a, plot, sy..."
24997,0,This movie . . . I don't know. Why they would ...,"[this, movie, ., ., ., I, do, not, know, ., wh..."
24998,1,Saw this film on DVD yesterday and was gob-sma...,"[see, this, film, on, dvd, yesterday, and, be,..."


In [None]:
docs_test = [nlp(doc) for doc in test_df.review.values]

In [None]:
lemmas_test = []
for doc in docs_test:
    lemma_t = []
    for token in doc:
        lemma_t.append(''.join(token.lemma_))
    lemmas_test.append(lemma_t)


In [None]:
test_df.head(2)

In [26]:
vectorizer_lem = TfidfVectorizer(ngram_range=(2, 6), max_features=20000, analyzer='char')
classifier_lem = LogisticRegression()

model_lem = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model_lem.fit(train_df['lemmas'], train_df['is_positive'])

eval_model(model_lem, test_df)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Test accuracy = 87.86%


**Задание** Сделайте классификатор на лемматизированных текстах.

In [33]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

**Задание** Попробуйте вместо лемм классифицировать основы.

In [None]:
displacy.render(docs[0], style='ent', jupyter=True)


**Задание** Удалите из текстов какие-то из сущностей, пользуясь координатами из запикленных файлов. Описание сущностей можно посмотреть [здесь](https://spacy.io/api/annotation#named-entities). Запустите классификатор.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

_, _, hist = plt.hist(train_df.review.apply(lambda text: len(text.split())), bins='auto')
hist

In [None]:
from collections import Counter

words_counter = Counter((word for text in train_df.review for word in text.lower().split()))

word2idx = {
    '': 0,
    '<unk>': 1
}
for word, count in words_counter.most_common():
    if count < 10:
        break
        
    word2idx[word] = len(word2idx)
    
print('Words count', len(word2idx))

**Задание** Сконвертируйте данные

In [None]:
def convert(texts, word2idx, max_text_len):
    data = np.zeros((len(texts), max_text_len), dtype=np.int)
    
    for inx, text in enumerate(texts):
        result = []
        for word in text.split():
            if word in word2idx:
                result.append(word2idx[word])
        padding = [0]*(max_text_len - len(result))
        data[inx] = np.array(padding + result[-max_text_len:], dtype=np.int)
    return data

X_train = convert(train_df.review, word2idx, 1000)
X_test = convert(test_df.review, word2idx, 1000)

In [None]:
import tensorflow as tf
tf.__version__

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalMaxPooling1D, Dropout

In [None]:
model = Sequential([
    Embedding(input_dim=len(word2idx), output_dim=64, input_shape=(X_train.shape[1],)),
    GlobalMaxPooling1D(),
    Dense(units=10, activation='relu'),
    Dense(units=10, activation='relu'),
    
    Dense(units=1, activation='sigmoid')
])

model.summary()
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.fit(X_train, train_df.is_positive, batch_size=128, epochs=10, 
          validation_data=(X_test, test_df.is_positive))

**Задание** Подсчитайте качество модели на тесте