# Getting data for pair classification

According to [this](https://arxiv.org/pdf/2202.01924.pdf) article about zero-shot learning data needs to be in the following format:

$I_i = [CLS] + P_i + [SEP] + H_i + [SEP]$

where P_i is a premise, H_i is a hypothesis. In our case, sentiment or category will be a class of a sample.

## Imports

In [None]:
!pip3 install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.5.0-py3-none-any.whl (802 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234911 sha256=f814555aa71faa5271cff43930d450b9ba74ce9535aa937337bbe92edfbdc7a5
  Stored in directory: /root/.cache/pip/wheels/02/3d/88/51a592b9ad17e7899126563698b4e3961983ebe85747228ba6
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully instal

In [None]:
from collections import defaultdict
from google.colab import drive
import logging
import random
import os
import pandas as pd
import numpy as np

import stanza

stanza.download('ru')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: ru (Russian) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def seed_everything(seed=42) -> None:
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

In [None]:
seed_everything()

In [None]:
SENTIMENT = ['neutral', 'positive', 'negative']
RESTAURANT_CATEGORIES = ['Whole', 'Interior', 'Service', 'Food', 'Price']
AUTOMOBILE_CATEGORIES = ['Comfort', 'Appearance', 'Reliability', 'Safety', 'Driveability', 'Whole', 'Costs']
CATEGORIES = ['Whole', 'Interior', 'Service', 'Food', 'Price', 'Comfort', 'Appearance', 'Reliability', 'Safety', 'Driveability', 'Costs']

In [None]:
SPECIAL_TOKENS = {
    'bert': {
        'cls': '[CLS]',
        'sep': '[SEP]'
    },
    'xlm': {
        'cls': '<s>',
        'sep': '</s>'
    }
}

## Pair formation

In [None]:
train_restaurants_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/train_split_reviews.txt', delimiter='\t', names=['text_id', 'text'])
train_automobiles_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/train_split_reviews.txt', delimiter='\t', names=['text_id', 'text'])

In [None]:
dev_restaurants_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/dev_reviews.txt', delimiter='\t', names=['text_id', 'text'])
dev_automobiles_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/dev_reviews.txt', delimiter='\t', names=['text_id', 'text'])

In [None]:
print(len(dev_restaurants_reviews))
print(len(dev_automobiles_reviews))

101
105


In [None]:
train_restaurants_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/train_split_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])
train_automobiles_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/train_split_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])

In [None]:
dev_restaurants_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/dev_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])
dev_automobiles_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/dev_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])

In [None]:
nlp = stanza.Pipeline('ru', processors='tokenize')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Done loading processors!


In [None]:
def get_data(reviews: pd.DataFrame, aspects: pd.DataFrame, categories: list, model: str) -> tuple:
    '''
    Get data for pair classification with stanza.
    '''
    spec_tokens = SPECIAL_TOKENS.get(model, None)
    if not spec_tokens:
        raise ValueError('Provide model name to get right special tokens!')

    sep = spec_tokens.get('sep', None)
    cls = spec_tokens.get('cls', None)

    review_ids = []
    data = []
    category_ids = []
    sentiment_ids = []

    reviews_sentences = defaultdict(list)

    logging.warning('Start getting data...')

    for rev_idx, rev in reviews.iterrows():
        text_id = int(rev['text_id'])
        text = rev['text']

        logging.warning('Text ID: %s' % text_id)

        # stanza processing to parse sentences and start and end characters
        doc = nlp(text)
        logging.warning('Processed by stanza')
        sents_with_end = {}
        for sent in doc.sentences:
            sentence = []
            for token in sent.tokens:
                sentence.append(token.text)
            end_idx = token.end_char  # last token in the current sentence
            sentence = sent.text
            sents_with_end[end_idx] = sentence
            reviews_sentences[text_id].append(sentence)

        logging.warning('Got sentences and their ends')

        # catch needed sentence
        rev_aspects = aspects[aspects['text_id'] == text_id]
        logging.warning('Got aspects for the current review')

        for asp_idx, asp in rev_aspects.iterrows():
            mention_end_char = asp['end']

            for end_char in sents_with_end:
                if end_char >= mention_end_char:
                    sentence = sents_with_end[end_char]  # first sentence where end > end of mention
                    mention = asp['mention']
                    data.append(f'{cls} {sentence} {sep} {mention} {sep}')

                    category_ids.append(categories.index(asp['category']))
                    sentiment_ids.append(SENTIMENT.index(asp['sentiment']))

                    review_ids.append(text_id)

                    break  # just one sentence

        logging.warning('Got data for the current review')

    return review_ids, data, category_ids, sentiment_ids, reviews_sentences

### Restaurants

In [None]:
eval_reviews = dev_restaurants_reviews[:50]
test_reviews = dev_restaurants_reviews[50:]

### BERT

In [None]:
train_review_ids, train_data, train_category_ids, train_sentiment_ids, train_reviews_sentences = get_data(
    train_restaurants_reviews, train_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'bert')
eval_review_ids, eval_data, eval_category_ids, eval_sentiment_ids, eval_reviews_sentences = get_data(
    eval_reviews, dev_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'bert')
test_review_ids, test_data, test_category_ids, test_sentiment_ids, test_reviews_sentences = get_data(
    test_reviews, dev_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'bert')



In [None]:
train_restaurants_bert = pd.DataFrame(
    {
        'text_id': train_review_ids,
        'text': train_data,
        'category': train_category_ids,
        'sentiment': train_sentiment_ids
    }
)
train_restaurants_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/train_bert_dataset.csv', sep='\t', index=False)

eval_restaurants_bert = pd.DataFrame(
    {
        'text_id': eval_review_ids,
        'text': eval_data,
        'category': eval_category_ids,
        'sentiment': eval_sentiment_ids
    }
)
eval_restaurants_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/eval_bert_dataset.csv', sep='\t', index=False)

test_restaurants_bert = pd.DataFrame(
    {
        'text_id': test_review_ids,
        'text': test_data,
        'category': test_category_ids,
        'sentiment': test_sentiment_ids
    }
)
test_restaurants_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/test_bert_dataset.csv', sep='\t', index=False)

In [None]:
test_restaurants_bert['text'].values

array(['[CLS] За пару дней заказал столик , что оказалось весьма кстати-несмотря на будний день , веранда была занята полностью . [SEP] заказал столик [SEP]',
       '[CLS] При входе улыбчивая девушка хостесс встретила и проводила к столику - приятно . [SEP] девушка хостесс [SEP]',
       '[CLS] При входе улыбчивая девушка хостесс встретила и проводила к столику - приятно . [SEP] встретила [SEP]',
       ..., '[CLS] Отличный ресторан ! [SEP] ресторан [SEP]',
       '[CLS] Мальчикам официантам - большое СПАСИБО за хорошую работу ! ! ! [SEP] официантам [SEP]',
       '[CLS] P. S. Советую всем Две палочки ) ) [SEP] Две палочки [SEP]'],
      dtype=object)

### XLM-RoBERTa

In [None]:
logging.warning('TRAIN')
train_review_ids, train_data, train_category_ids, train_sentiment_ids, train_reviews_sentences = get_data(
    train_restaurants_reviews, train_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'xlm')

logging.warning('EVAL')
eval_review_ids, eval_data, eval_category_ids, eval_sentiment_ids, eval_reviews_sentences = get_data(
    eval_reviews, dev_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'xlm')

logging.warning('TEST')
test_review_ids, test_data, test_category_ids, test_sentiment_ids, test_reviews_sentences = get_data(
    test_reviews, dev_restaurants_aspects,
    RESTAURANT_CATEGORIES, 'xlm')



In [None]:
train_restaurants_xlmroberta = pd.DataFrame(
    {
        'text_id': train_review_ids,
        'text': train_data,
        'category': train_category_ids,
        'sentiment': train_sentiment_ids
    }
)
train_restaurants_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/train_xlmroberta_dataset.csv', sep='\t', index=False)

eval_restaurants_xlmroberta = pd.DataFrame(
    {
        'text_id': eval_review_ids,
        'text': eval_data,
        'category': eval_category_ids,
        'sentiment': eval_sentiment_ids
    }
)
eval_restaurants_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/eval_xlmroberta_dataset.csv', sep='\t', index=False)

test_restaurants_xlmroberta = pd.DataFrame(
    {
        'text_id': test_review_ids,
        'text': test_data,
        'category': test_category_ids,
        'sentiment': test_sentiment_ids
    }
)
test_restaurants_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/restaurants/test_xlmroberta_dataset.csv', sep='\t', index=False)

In [None]:
test_restaurants_xlmroberta.values

array([[13100,
        '<s> За пару дней заказал столик , что оказалось весьма кстати-несмотря на будний день , веранда была занята полностью . </s> заказал столик </s>',
        2, 0],
       [13100,
        '<s> При входе улыбчивая девушка хостесс встретила и проводила к столику - приятно . </s> девушка хостесс </s>',
        2, 1],
       [13100,
        '<s> При входе улыбчивая девушка хостесс встретила и проводила к столику - приятно . </s> встретила </s>',
        2, 1],
       ...,
       [4658, '<s> Отличный ресторан ! </s> ресторан </s>', 0, 1],
       [4658,
        '<s> Мальчикам официантам - большое СПАСИБО за хорошую работу ! ! ! </s> официантам </s>',
        2, 1],
       [4658,
        '<s> P. S. Советую всем Две палочки ) ) </s> Две палочки </s>',
        0, 1]], dtype=object)

### Automobiles

In [None]:
eval_reviews = dev_automobiles_reviews[:50]
test_reviews = dev_automobiles_reviews[50:]

### BERT

In [None]:
logging.warning('TRAIN')
train_review_ids, train_data, train_category_ids, train_sentiment_ids, train_reviews_sentences = get_data(
    train_automobiles_reviews, train_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'bert')

logging.warning('EVAL')
eval_review_ids, eval_data, eval_category_ids, eval_sentiment_ids, eval_reviews_sentences = get_data(
    eval_reviews, dev_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'bert')

logging.warning('TEST')
test_review_ids, test_data, test_category_ids, test_sentiment_ids, test_reviews_sentences = get_data(
    test_reviews, dev_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'bert')



In [None]:
train_automobiles_bert = pd.DataFrame(
    {
        'text_id': train_review_ids,
        'text': train_data,
        'category': train_category_ids,
        'sentiment': train_sentiment_ids
    }
)
train_automobiles_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/train_bert_dataset.csv', sep='\t', index=False)

eval_automobiles_bert = pd.DataFrame(
    {
        'text_id': eval_review_ids,
        'text': eval_data,
        'category': eval_category_ids,
        'sentiment': eval_sentiment_ids
    }
)
eval_automobiles_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/eval_bert_dataset.csv', sep='\t', index=False)

test_automobiles_bert = pd.DataFrame(
    {
        'text_id': test_review_ids,
        'text': test_data,
        'category': test_category_ids,
        'sentiment': test_sentiment_ids
    }
)
test_automobiles_bert.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/test_bert_dataset.csv', sep='\t', index=False)

In [None]:
test_automobiles_bert.head()

Unnamed: 0,text_id,text,category,sentiment
0,92845,[CLS] Недавно купил этот автомобиль . [SEP] ав...,5,0
1,92845,[CLS] Авто отличное ! [SEP] Авто [SEP],5,1
2,92845,"[CLS] Двигатель 2,5 литра , турбодизель . [SEP...",4,0
3,92845,"[CLS] Двигатель 2,5 литра , турбодизель . [SEP...",4,0
4,92845,"[CLS] Прежний хозяин сказал при продаже , что ...",4,1


### XLM-RoBERTa

In [None]:
logging.warning('TRAIN')
train_review_ids, train_data, train_category_ids, train_sentiment_ids, train_reviews_sentences = get_data(
    train_automobiles_reviews, train_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'xlm')

logging.warning('EVAL')
eval_review_ids, eval_data, eval_category_ids, eval_sentiment_ids, eval_reviews_sentences = get_data(
    eval_reviews, dev_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'xlm')

logging.warning('TEST')
test_review_ids, test_data, test_category_ids, test_sentiment_ids, test_reviews_sentences = get_data(
    test_reviews, dev_automobiles_aspects,
    AUTOMOBILE_CATEGORIES, 'xlm')



In [None]:
train_automobiles_xlmroberta = pd.DataFrame(
    {
        'text_id': train_review_ids,
        'text': train_data,
        'category': train_category_ids,
        'sentiment': train_sentiment_ids
    }
)
train_automobiles_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/train_xlmroberta_dataset.csv', sep='\t', index=False)

eval_automobiles_xlmroberta = pd.DataFrame(
    {
        'text_id': eval_review_ids,
        'text': eval_data,
        'category': eval_category_ids,
        'sentiment': eval_sentiment_ids
    }
)
eval_automobiles_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/eval_xlmroberta_dataset.csv', sep='\t', index=False)

test_automobiles_xlmroberta = pd.DataFrame(
    {
        'text_id': test_review_ids,
        'text': test_data,
        'category': test_category_ids,
        'sentiment': test_sentiment_ids
    }
)
test_automobiles_xlmroberta.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/automobiles/test_xlmroberta_dataset.csv', sep='\t', index=False)

In [None]:
test_automobiles_xlmroberta.head()

Unnamed: 0,text_id,text,category,sentiment
0,92845,<s> Недавно купил этот автомобиль . </s> автом...,5,0
1,92845,<s> Авто отличное ! </s> Авто </s>,5,1
2,92845,"<s> Двигатель 2,5 литра , турбодизель . </s> Д...",4,0
3,92845,"<s> Двигатель 2,5 литра , турбодизель . </s> т...",4,0
4,92845,"<s> Прежний хозяин сказал при продаже , что ма...",4,1


### Both

We need to reindex categories because they are different for both datasets.

In [None]:
def get_new_category_index(labels: list, idx: int, new_labels=CATEGORIES) -> int:
    '''
    Get category index for both datasets.
    '''
    label = labels[idx]
    new_idx = new_labels.index(label)

    return new_idx

In [None]:
def reindex_category(df: pd.DataFrame, labels: list) -> pd.DataFrame:
    '''
    Reindex category in the dataframe.
    '''
    new_df = df.copy()
    new_df['category'] = new_df['category'].apply(lambda x: get_new_category_index(labels, x))

    return new_df

In [None]:
len(CATEGORIES)

11

In [None]:
train_automobiles_bert.head()

Unnamed: 0,text_id,text,category,sentiment
0,1276288,[CLS] Используем данный автомобиль в целях все...,5,1
1,1276288,"[CLS] Автозапчасти недорогие , обслуживание до...",6,1
2,1276288,"[CLS] Автозапчасти недорогие , обслуживание до...",6,1
3,1276288,"[CLS] Автозапчасти недорогие , обслуживание до...",6,1
4,1276288,"[CLS] Автозапчасти недорогие , обслуживание до...",6,1


### BERT

In [None]:
train_dataset = pd.concat(
    [reindex_category(train_restaurants_bert, RESTAURANT_CATEGORIES),
     reindex_category(train_automobiles_bert, AUTOMOBILE_CATEGORIES)]
)
eval_dataset = pd.concat(
    [reindex_category(eval_restaurants_bert, RESTAURANT_CATEGORIES),
     reindex_category(eval_automobiles_bert, AUTOMOBILE_CATEGORIES)]
)
test_dataset = pd.concat(
    [reindex_category(test_restaurants_bert, RESTAURANT_CATEGORIES),
     reindex_category(test_automobiles_bert, AUTOMOBILE_CATEGORIES)]
)

In [None]:
train_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/train_bert_dataset.csv', sep='\t', index=False)
eval_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/eval_bert_dataset.csv', sep='\t', index=False)
test_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/test_bert_dataset.csv', sep='\t', index=False)

In [None]:
train_dataset.head()

Unnamed: 0,text_id,text,category,sentiment
0,10231,"[CLS] Я несколько раз была в этом заведении , ...",0,1
1,10231,"[CLS] Я несколько раз была в этом заведении , ...",3,1
2,10231,"[CLS] Потрясающая паста с лососем , очень вкус...",3,1
3,10231,"[CLS] Потрясающая паста с лососем , очень вкус...",3,1
4,10231,"[CLS] Потрясающая паста с лососем , очень вкус...",3,1


### XLM-RoBERTa

In [None]:
train_dataset = pd.concat(
    [reindex_category(train_restaurants_xlmroberta, RESTAURANT_CATEGORIES),
     reindex_category(train_automobiles_xlmroberta, AUTOMOBILE_CATEGORIES)]
)
eval_dataset = pd.concat(
    [reindex_category(eval_restaurants_xlmroberta, RESTAURANT_CATEGORIES),
     reindex_category(eval_automobiles_xlmroberta, AUTOMOBILE_CATEGORIES)]
)
test_dataset = pd.concat(
    [reindex_category(test_restaurants_xlmroberta, RESTAURANT_CATEGORIES),
     reindex_category(test_automobiles_xlmroberta, AUTOMOBILE_CATEGORIES)]
)

In [None]:
train_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/train_xlmroberta_dataset.csv', sep='\t', index=False)
eval_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/eval_xlmroberta_dataset.csv', sep='\t', index=False)
test_dataset.to_csv('/content/drive/MyDrive/Summarization/sentiment analysis/both/test_xlmroberta_dataset.csv', sep='\t', index=False)

In [None]:
train_dataset.head()

Unnamed: 0,text_id,text,category,sentiment
0,10231,"<s> Я несколько раз была в этом заведении , о ...",0,1
1,10231,"<s> Я несколько раз была в этом заведении , о ...",3,1
2,10231,"<s> Потрясающая паста с лососем , очень вкусны...",3,1
3,10231,"<s> Потрясающая паста с лососем , очень вкусны...",3,1
4,10231,"<s> Потрясающая паста с лососем , очень вкусны...",3,1


In [None]:
len(train_dataset)

14106