# Preprocessing for training

## Description:
Described the ways which basic (raw) dataset was preprocessed to become clean with *tokenization*, *lemmatization* and *stopwords elimination* applied. 

`pandarallel` is used to parallelize computations on 6-core CPU.

## Contents:
* imports & dataset initialization
* \[`razdel`, `TreebankWordTokenizer`, `rutokenizer`] $\times$
    [`nltk.~.words('russian')`, `spacy.~.STOP_WORDS`, no_named_custom_stopwords] 
* \[`razdel`, `rutokenizer`] $\times$ [`[]`]
* Attempt to utilize `rulemma` and `rupostagger` (required), but parallelization is not successsful.

In [1]:
%cd ../..

import dill
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers=12)

from src.nlp.preprocessing import clean
from datasets.getters import load_reviews_Review_Label

from nltk.corpus import stopwords
from spacy.lang.ru import stop_words


C:\Users\Yaroslav Pristalov\Documents\Programming\nlp-coursework
INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


In [2]:
data = load_reviews_Review_Label()
data

Unnamed: 0,review,label
0,«Зеленую милю» я смотрела два раза: 10 лет наз...,NEUTRAL
1,Период конца девяностых годов-начало двухтысяч...,POSITIVE
2,"Очень сложно писать рецензию на этот фильм, та...",POSITIVE
3,Любимая многими миллионами ценителями киноиску...,POSITIVE
4,В нашем мире существует много разных фильмов. ...,POSITIVE
...,...,...
90641,"Конечно, этот фильм - не лучший представитель ...",POSITIVE
90642,Фильм «Ламборгини: Человек-легенда» снят в 202...,NEGATIVE
90643,"Эй, рагацци, вы это серьёзно, ТАК показывать и...",NEGATIVE
90644,"Вообще, говоря о байопиках, стоит отметить, чт...",NEGATIVE


In [3]:
with open('stop-ru.txt', 'rt', encoding='UTF-8') as sw:
    _file_stop_words = sw.read().split('\n')

for _tokenizer in ['razdel', 'TreebankWordTokenizer', 'rutokenizer']:
    for i, _stopwords in enumerate([stopwords.words('russian'), stop_words.STOP_WORDS, _file_stop_words]):        
        i_to_sw_name = {0: 'nltk', 1: 'spacy', 2: 'third_party_nltk'}
        print(f'{_tokenizer=}, stopwords={i_to_sw_name[i]}')
        data = load_reviews_Review_Label()
        print(f'{data.shape=}')
        
        %time data['review'] = data['review'].parallel_apply(clean, args=(_tokenizer, _stopwords))

        
        with open(f'reviews_Review_Label_{_tokenizer}_{i_to_sw_name[i]}.df', 'wb') as file:
            dill.dump(data, file)
            print(f'data was saved to {file.name}\n\n')

_tokenizer='razdel', stopwords=nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 1min 35s
data was saved to reviews_Review_Label_razdel_nltk.df


_tokenizer='razdel', stopwords=spacy
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 1min 43s
data was saved to reviews_Review_Label_razdel_spacy.df


_tokenizer='razdel', stopwords=third_party_nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 2min 48s
data was saved to reviews_Review_Label_razdel_third_party_nltk.df


_tokenizer='TreebankWordTokenizer', stopwords=nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 2min 15s
data was saved to reviews_Review_Label_TreebankWordTokenizer_nltk.df


_tokenizer='TreebankWordTokenizer', stopwords=spacy
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 2min 51s
data was saved to reviews_Review_Label_TreebankWordTokenizer_spacy.df


_tokenizer='TreebankWordTokenizer', stopwords=third_party_nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 1min 21s
data was saved to reviews_Review_Label_TreebankWordTokenizer_third_party_nltk.df


_tokenizer='rutokenizer', stopwords=nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 4min 18s
data was saved to reviews_Review_Label_rutokenizer_nltk.df


_tokenizer='rutokenizer', stopwords=spacy
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 3min 47s
data was saved to reviews_Review_Label_rutokenizer_spacy.df


_tokenizer='rutokenizer', stopwords=third_party_nltk
data.shape=(90646, 2)


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 6min 27s
data was saved to reviews_Review_Label_rutokenizer_third_party_nltk.df




In [3]:
for _tokenizer in ['razdel', 'rutokenizer']:
    for i, _stopwords in enumerate([[]]):        
        i_to_sw_name = {0: 'no'}
        print(f'{_tokenizer=}, stopwords={i_to_sw_name[i]}')
        data = load_reviews_Review_Label()
        print(f'{data.shape=}')
        
        %time data['review'] = data['review'].parallel_apply(clean, args=(_tokenizer, _stopwords))

        
        with open(f'reviews_Review_Label_{_tokenizer}_{i_to_sw_name[i]}.df', 'wb') as file:
            dill.dump(data, file)
            print(f'data was saved to {file.name}\n\n')

_tokenizer='razdel', stopwords=no
data.shape=(90646, 2)
<function clean at 0x000002685B247C10> razdel []


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 11min 7s
data was saved to reviews_Review_Label_razdel_no.df


_tokenizer='rutokenizer', stopwords=no
data.shape=(90646, 2)
<function clean at 0x000002685B247C10> rutokenizer []


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=7554), Label(value='0 / 7554'))), …

Wall time: 1h 11min 21s
data was saved to reviews_Review_Label_rutokenizer_no.df




In [None]:
# Unsuccessful, since requires copying loaded databases several times or loading again each iteration which is inappropriate.

with open('stop-ru.txt', 'rt', encoding='UTF-8') as sw:
    _file_stop_words = sw.read().split('\n')
    
import rulemma
import rupostagger
    
lemmatizer = rulemma.Lemmatizer()
lemmatizer.load()

tagger = rupostagger.RuPosTagger()
tagger.load()

for _tokenizer in ['razdel', 'TreebankWordTokenizer', 'rutokenizer']:
    for i, _stopwords in enumerate([stopwords.words('russian'), stop_words.STOP_WORDS, _file_stop_words]):        
        i_to_sw_name = {0: 'nltk', 1: 'spacy', 2: 'third_party_nltk'}
        print(f'{_tokenizer=}, stopwords={i_to_sw_name[i]}, lemmatizer=rulemma')
        data = load_reviews_Review_Label()
        print(f'{data.shape=}')
        
        %time data['review'] = data['review'][:10].apply(clean, args=(_tokenizer, _stopwords, 'rulemma', tagger, lemmatizer))

        
        with open(f'reviews_Review_Label_{_tokenizer}_rulemma_{i_to_sw_name[i]}.df', 'wb') as file:
            dill.dump(data, file)
            print(f'data was saved to {file.name}\n\n')

_tokenizer='razdel', stopwords=nltk, lemmatizer=rulemma
data.shape=(90646, 2)
Wall time: 1.15 s
data was saved to reviews_Review_Label_razdel_rulemma_nltk.df


_tokenizer='razdel', stopwords=spacy, lemmatizer=rulemma
data.shape=(90646, 2)
Wall time: 1.01 s
data was saved to reviews_Review_Label_razdel_rulemma_spacy.df


_tokenizer='razdel', stopwords=third_party_nltk, lemmatizer=rulemma
data.shape=(90646, 2)
