# NLP

Classify Russian texts into several categories. It is best if the body of the texts is really large. To pre-process texts: normalization, lemmatization, etc. Compare embeddings. Try several classification methods.

## Import libs and load data

In [81]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/akimg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/akimg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [82]:
import pymorphy2
from string import punctuation

def lemmatize(input_text):
    morph = pymorphy2.MorphAnalyzer()
    tokens = nltk.word_tokenize(input_text)
    normed_tokens = [morph.parse(s)[0].normal_form for s in tokens]
    
    # we also exclude stop words - all sorts of prepositions, conjunctions, etc.
    normed_tokens = [word for word in normed_tokens if word not in nltk.corpus.stopwords.words("russian")]
    
    # and punctuation marks
    normed_tokens = [word for word in normed_tokens if word not in punctuation]
    return ' '.join(normed_tokens)

In [83]:
import os
import pandas as pd

# prepare an empty data frame
df = pd.DataFrame(columns=['text', 'class'])

# these are folders in which files with texts
dir0 = "data/Bulgakov/"
dir1 = "data/Soltykov/"

In [84]:
# consider all our texts in a data frame indicating the class
for filename in os.listdir(dir0):
    with open(os.path.join(dir0, filename), encoding='utf8') as file:
        contents = lemmatize(file.read())
    df = df.append(pd.Series({'text': contents, 'class': 0}), ignore_index=True)

In [85]:
# and for the second folder too
for filename in os.listdir(dir1):
    with open(os.path.join(dir1, filename), encoding='utf8') as file:
        contents = lemmatize(file.read())
    df = df.append(pd.Series({'text': contents, 'class': 1}), ignore_index=True)

In [86]:
df

Unnamed: 0,text,class
0,пёс остаться подворотня страдать изуродовать б...,0
1,глянуть погибать вьюга подворотня ревета отход...,0
2,десять минута иван арнольд шарик одетый кепка ...,0
3,касаться внутренний содержание « летописец » о...,1
4,давно иметь намерение написать история какой-н...,1
5,ходить ходить комната сесть посидеть весь дума...,1


In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['class'], test_size=0.12, stratify=df['class'])

ValueError: The test_size = 1 should be greater or equal to the number of classes = 2

## Bag-of-Words-embedding

Of course, mathematical methods are not able to work with clear text. It is time to get embeddings!

In [51]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

bof_vect = CountVectorizer()
bof_vect.fit(np.hstack([X_train, X_test]))
bof_train = bof_vect.transform(X_train)
bof_test = bof_vect.transform(X_test)

In [52]:
bof_train.toarray()

array([[0, 0, 1, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [53]:
bof_train.toarray().shape

(7, 1138)

## TF-IDF-embedding

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(np.hstack([X_train, X_test]))
tfidf_train = tfidf_vect.transform(X_train)
tfidf_test = tfidf_vect.transform(X_test)

In [55]:
tfidf_train.toarray()

array([[0.        , 0.        , 0.03642567, ..., 0.        , 0.        ,
        0.02798311],
       [0.        , 0.        , 0.        , ..., 0.        , 0.07308521,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.04705378],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.08240483, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.04705378]])

In [56]:
tfidf_train.toarray().shape

(7, 1138)

## Word2vec-embedding

Since w2v is not a sklearn classifier, it will output data of a slightly different type at the output, and this will need to be taken into account in future work.

In [59]:
from gensim.models import Word2Vec

X_train_w2v = X_train.apply(str.split)
X_test_w2v = X_test.apply(str.split)
w2v_vect = Word2Vec(np.hstack([X_train_w2v, X_test_w2v]), size=300, min_count=10, workers=8)

In [60]:
X_train_w2v

5    [десять, минута, иван, арнольд, шарик, одетый,...
0    [пёс, остаться, подворотня, страдать, изуродов...
1    [глянуть, погибать, вьюга, подворотня, ревета,...
7    [давно, иметь, намерение, написать, история, к...
8    [ходить, ходить, комната, сесть, посидеть, вес...
6    [касаться, внутренний, содержание, «, летописе...
4    [глянуть, погибать, вьюга, подворотня, ревета,...
Name: text, dtype: object

You can do various interesting things with Word-Tu-Century. For example, with the following command, we can display the words that turned out to be closest in value to the given word in the training set.

In [64]:
w2v_vect.most_similar(positive="печаль")

  """Entry point for launching an IPython kernel.


KeyError: "word 'печаль' not in vocabulary"

In [65]:
w2v_vect['весь']

  """Entry point for launching an IPython kernel.


array([ 1.10380037e-03,  2.36471431e-04,  3.41607112e-04, -9.93257854e-04,
        1.47163728e-03, -1.62711355e-03,  3.34925280e-04, -4.98512120e-04,
       -1.04274251e-03, -1.61193428e-03,  1.65709842e-03, -8.07390839e-04,
       -1.76269910e-04, -9.08700109e-04,  9.87880281e-04, -1.49801385e-03,
        1.27298699e-03, -6.10415940e-04,  4.64814162e-04,  3.70533089e-04,
        1.30560366e-03, -7.93955754e-04,  5.07335819e-04,  1.02873740e-03,
       -1.87988777e-03, -5.61054156e-04, -6.07786678e-05, -1.11690478e-03,
       -8.27114040e-04, -1.74664092e-04,  1.90636420e-04,  8.21333437e-04,
       -1.69483933e-03, -3.12735472e-04, -1.19259849e-03, -6.27662230e-04,
        3.92277172e-04, -5.39602945e-04, -6.08119357e-04,  1.19828410e-03,
       -4.61073905e-05,  1.16876699e-03, -9.26123350e-04,  1.63124059e-03,
        1.85031653e-03, -1.10266556e-03,  8.96488433e-04,  3.49714479e-04,
       -3.34296812e-04,  4.11453831e-04, -3.87753418e-04, -1.03718473e-03,
        1.69546308e-03, -

Преобразуем тексты песен в вектора - возьмем сумму векторов всех слов, которые входят в песню

In [66]:
def text2vec(text):
    # We average the word vectors
    vecs = []
    for word in text:
        try:
            vecs.append(w2v_vect[word])
        except KeyError:
            pass
    return np.sum(vecs, axis=0) / len(vecs)

w2v_train = X_train_w2v.apply(text2vec)
w2v_test = X_test_w2v.apply(text2vec)
w2v_train

  


5    [-0.00022382694, -0.0004979284, -0.00044563122...
0    [-0.00049684616, 0.0001568521, 0.0003876275, -...
1    [-0.00021404464, 0.00017311411, 0.00051015563,...
7    [-0.00022130257, 0.00016539945, 0.00038072478,...
8    [-1.7317863e-05, 0.00032183586, 0.00016755973,...
6    [-0.00019335303, 0.00023932513, 0.00022692198,...
4    [-0.00021404464, 0.00017311411, 0.00051015563,...
Name: text, dtype: object

In [67]:
w2v_train.shape

(7,)

In [68]:
w2v_train[0]

array([-4.9684616e-04,  1.5685210e-04,  3.8762749e-04, -8.8967872e-04,
        3.5284198e-04, -9.0264285e-04,  3.4743326e-04, -4.7636739e-04,
        3.3773955e-05, -7.1341044e-04, -3.8116064e-04, -4.0370508e-04,
       -4.0692033e-04, -1.0134970e-04,  1.8529981e-05, -2.3438803e-04,
        4.8643170e-04,  3.4365899e-04,  1.7414441e-04, -2.6375137e-04,
        8.9235511e-04, -5.8396236e-04,  2.3565056e-04,  4.3559555e-04,
       -7.9668465e-04,  6.5353973e-04, -1.9244240e-04,  4.4904248e-04,
       -1.1846103e-03,  3.9504630e-05, -1.8168146e-05, -4.2719554e-04,
       -2.3957086e-04,  9.7934761e-05,  4.2912591e-04, -5.4771744e-04,
        4.1007119e-05,  6.7932218e-05,  2.7127250e-04,  3.2135338e-04,
       -2.2295325e-04,  9.4294612e-04,  2.2910390e-04,  2.8152743e-04,
        9.1577508e-04, -4.2839831e-04,  4.5135818e-04, -7.0421083e-04,
        7.8954530e-04,  5.8503199e-04, -1.4441827e-04, -2.7366559e-04,
        1.8802746e-04,  8.8389107e-04,  2.2675243e-04, -5.8865413e-04,
      

In [69]:
w2v_train = np.dstack(w2v_train)[0]
w2v_train.shape

(300, 7)

In [70]:
w2v_test = np.dstack(w2v_test)[0]

## Text classification

Now we have a classic feature description of each text. We can train classifiers or come up with some other metric.

Let's calculate for each embedding two total vectors - for the texts of Philip and for the texts of Aria. For Bag of Words:

In [71]:
kirk_mean_bof = np.sum(bof_train[y_train == 0], axis=0)
kirk_mean_bof.shape

ValueError: provided out is the wrong size for the reduction

In [72]:
aria_mean_bof = np.sum(bof_train[y_train == 1], axis=0)

ValueError: provided out is the wrong size for the reduction

Для TF-IDF

In [73]:
kirk_mean_tfidf = np.sum(tfidf_train[y_train == 0], axis=0)
kirk_mean_tfidf.shape

ValueError: provided out is the wrong size for the reduction

Для Word2Vec

In [None]:
kirk_mean_w2v = np.sum(w2v_train[:, y_train == 0], axis=1)
kirk_mean_w2v.shape

In [None]:
aria_mean_w2v = np.sum(w2v_train[:, y_train == 1], axis=1)

Посмотрим на их вид:

In [None]:
kirk_mean_bof

In [None]:
kirk_mean_tfidf

In [None]:
kirk_mean_w2v

And now let's build data frames with the results of the classification of test texts. We assume that the text refers to the performer with whom its cosine distance is greater.

In [None]:
from scipy.spatial.distance import cosine
bof_kirk = np.apply_along_axis(cosine, 1, bof_test.toarray(), v=kirk_mean_bof)
bof_aria = np.apply_along_axis(cosine, 1, bof_test.toarray(), v=aria_mean_bof)

bof_results = pd.DataFrame([
    bof_kirk,
    bof_aria,
    np.maximum(bof_kirk, bof_aria) == bof_aria,
    y_test
], index=["kirk", "aria", "predict", "class"]).T.astype(np.float)
bof_results

And calculate accuracy for predictions

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(bof_results['predict'], bof_results['class'])

In [None]:
tfidf_kirk = np.apply_along_axis(cosine, 1, tfidf_test.toarray(), v=kirk_mean_tfidf)
tfidf_aria = np.apply_along_axis(cosine, 1, tfidf_test.toarray(), v=aria_mean_tfidf)

tfidf_results = pd.DataFrame([
    tfidf_kirk,
    tfidf_aria,
    np.maximum(tfidf_kirk, tfidf_aria) == tfidf_aria,
    y_test
], index=["kirk", "aria", "predict", "class"]).T.astype(np.float)
tfidf_results

In [None]:
accuracy_score(tfidf_results['predict'], tfidf_results['class'])

In [None]:
w2v_kirk = np.apply_along_axis(cosine, 0, w2v_test, v=kirk_mean_w2v)
w2v_aria = np.apply_along_axis(cosine, 0, w2v_test, v=aria_mean_w2v)

w2v_results = pd.DataFrame([
    w2v_kirk,
    w2v_aria,
    np.maximum(w2v_kirk, w2v_aria) == w2v_aria,
    y_test
], index=["kirk", "aria", "predict", "class"]).T.astype(np.float)
w2v_results

In [None]:
accuracy_score(w2v_results['predict'], w2v_results['class'])

We see that simple embeddings solve this problem with such methods better, and w2v gives a big error. Why do you think so?

Finally, let’s try to apply some classic machine learning model, such as a random forest, on top of embeddings.


In [None]:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier().fit(bof_train.toarray(), y_train.tolist()).score(bof_test.toarray(), y_test.tolist())

In [None]:
RandomForestClassifier().fit(tfidf_train.toarray(), y_train.tolist()).score(tfidf_test.toarray(), y_test.tolist())

In [None]:
RandomForestClassifier().fit(w2v_train.T, y_train.tolist()).score(w2v_test.T, y_test.tolist())

Quality is better