# Getting data for token classification task
We tend to get data for token classification task. We have datasets on restaurants and automobiles that have to be processed. To train our model for token classification we must have sentences from the texts.

Our datasets have specific markup that require predefined tokenization and labeling.



## Imports

In [None]:
!pip3 install stanza --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for emoji (setup.py) ... [?25l[?25hdone


In [None]:
import logging
from string import punctuation
from google.colab import drive
import stanza
import numpy as np
import pandas as pd

stanza.download('ru')

logging.basicConfig(level=logging.INFO)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: ru (Russian) ...


Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


In [None]:
PUNCTUATION = punctuation.replace('\'', '').replace('"', '')

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


## Getting data

In [None]:
train_restaurants_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/train_split_reviews.txt', delimiter='\t', names=['text_id', 'text'])
train_automobiles_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/train_split_reviews.txt', delimiter='\t', names=['text_id', 'text'])

train_reviews = pd.concat([train_restaurants_reviews, train_automobiles_reviews])

train_restaurants_reviews.head()

Unnamed: 0,text_id,text
0,10231,"Я несколько раз была в этом заведении,о кухне ..."
1,6376,Впервые посетила это замечательное место! С пе...
2,7824,"Праздновала в этом ресторане свой Др, праздник..."
3,11825,После прогулки решили с подругой зайти-перекус...
4,2107,Случайно зашли в это заведение с друзьями. Сде...


In [None]:
print(len(train_reviews))
print(len(set(train_reviews['text_id'].values)))

616
616


In [None]:
dev_restaurants_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/dev_reviews.txt', delimiter='\t', names=['text_id', 'text'])
dev_automobiles_reviews = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/dev_reviews.txt', delimiter='\t', names=['text_id', 'text'])

dev_reviews = pd.concat([dev_restaurants_reviews, dev_automobiles_reviews])

len(dev_reviews)

206

In [None]:
train_restaurants_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/train_split_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])
train_automobiles_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/train_split_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])

train_aspects = pd.concat([train_restaurants_aspects, train_automobiles_aspects])

train_restaurants_aspects.head()

Unnamed: 0,text_id,category,mention,start,end,sentiment
0,30808,Whole,ресторане,16,25,neutral
1,30808,Interior,первом этаже,43,55,neutral
2,30808,Whole,руководству ресторана,124,145,positive
3,30808,Service,обслуживающему персоналу,147,171,positive
4,30808,Service,сотрудникам,189,200,positive


In [None]:
dev_restaurants_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/restaurant data/dev_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])
dev_automobiles_aspects = pd.read_csv('/content/drive/MyDrive/Summarization/automobile data/dev_aspects.txt', delimiter='\t', names=['text_id', 'category', 'mention', 'start', 'end', 'sentiment'])

dev_aspects = pd.concat([dev_restaurants_aspects, dev_automobiles_aspects])

len(dev_aspects)

4794

In [None]:
# to divide into parts after processing
eval_restaurants_reviews = dev_restaurants_reviews[:50]
test_restaurants_reviews = dev_restaurants_reviews[50:]
eval_automobiles_reviews = dev_automobiles_reviews[:50]
test_automobiles_reviews = dev_automobiles_reviews[50:]

eval_reviews = pd.concat([eval_restaurants_reviews, eval_automobiles_reviews])
test_reviews = pd.concat([test_restaurants_reviews, test_automobiles_reviews])

## Processing data

In [None]:
nlp = stanza.Pipeline('ru', processors='tokenize', verbose=False)

In [None]:
doc = nlp('Это первое предложение. А это второе предложение')

for sent in doc.sentences:
    print(sent.tokens[-1].text, sent.tokens[-1].end_char, sep='\t')

.	23
предложение	48


In [None]:
BIO = ['B-ASPECT', 'I-ASPECT', 'O']
BIO_sent = ['B-POS', 'I-POS', 'B-NEG', 'I-NEG', 'B-NEUT', 'I-NEUT', 'O']

label2id = {label: i for i, label in enumerate(BIO)}
id2label = {i: label for i, label in enumerate(BIO)}
label2id_sent = {label: i for i, label in enumerate(BIO_sent)}
id2label_sent = {i: label for i, label in enumerate(BIO_sent)}

After `stanza` tokenization there are many sticked punctuation marks that often makes indentification of aspects in the tokenized text more difficult.

In [None]:
text = ',,,начинка"'
start_char = 13
end_char = 23

right_text = text.lstrip(PUNCTUATION)
start_char = start_char + (len(text) - len(right_text))

token_text = right_text.rstrip(PUNCTUATION)
end_char = end_char - (len(right_text) - len(token_text))

In [None]:
print(token_text)
print(start_char, end_char)

начинка"
16 23


There is another difficult: in the dataset there are cases of aspects overlapping. This problem can be solved in two ways:


*   Keep only the aspects that have maximum spans
*   Generate many variations of aspects markup that have smaller and larger spans of the text


*Note: using second variant we can get additional data for training (variation of data augmentation)*


In [None]:
def get_data(reviews: pd.DataFrame, aspects: pd.DataFrame) -> tuple:
    '''
    Get tokens and labels to fine-tune models for AE as NER task.
    '''
    bad_labels = 0

    review_ids = []
    sentences = []
    data = []
    aspect_labels = []
    sentiment_labels = []

    logging.warning('Start getting data...')

    for rev_idx, rev in reviews.iterrows():
        # get text id and text
        text_id = int(rev['text_id'])
        text = rev['text']

        logging.warning('Text ID: %s' % text_id)

        # stanza processing to parse sentences
        doc = nlp(text)
        logging.warning('Processed by stanza...')

        # get needed mentions and aspects
        rev_aspects = aspects[aspects['text_id'] == text_id]

        mentions = rev_aspects['mention'].values.tolist()
        starts = rev_aspects['start'].values.tolist()
        ends = rev_aspects['end'].values.tolist()
        sentiment = rev_aspects['sentiment'].values.tolist()

        assert len(starts) == len(ends)

        # parse sentences
        logging.warning('Parse sentences...')
        for sent in doc.sentences:

            review_ids.append(text_id)  # text id that corresponds to sentence
            sentences.append(sent.text)

            sentence = []
            sentence_aspect_labels = []
            sentence_sentiment_labels = []

            # current state for multiple token aspects
            current_start = None
            current_end = 0
            current_id = None

            for token_idx, token in enumerate(sent.tokens):
                sentence.append(token.text)

                # save state of the token and indexes
                token_text = token.text
                start_char = token.start_char
                end_char = token.end_char

                # print(token_text, start_char, end_char)

                # remove punctuation from token string
                if token_text not in PUNCTUATION:
                    right_text = token_text.lstrip(PUNCTUATION)
                    start_char = start_char + (len(token_text) - len(right_text))

                    token_text = right_text.rstrip(PUNCTUATION)
                    end_char = end_char - (len(right_text) - len(token_text))

                # if after that string does not match start or start+end
                # it is outside label
                # print(token_text, start_char, end_char)

                # if we have multiple tokens in one aspect
                # if aspects start are enclosed then prevent new beginning
                if current_start:
                    # print('CURRENT START')
                    sentiment_value = sentiment[current_id]
                    sentence_aspect_labels.append(label2id.get('I-ASPECT', None))
                    if sentiment_value == 'positive':
                        sentence_sentiment_labels.append(label2id_sent.get('I-POS', None))
                    elif sentiment_value == 'negative':
                        sentence_sentiment_labels.append(label2id_sent.get('I-NEG', None))
                    elif sentiment_value == 'neutral' or sentiment_value == 'both':
                        sentence_sentiment_labels.append(label2id_sent.get('I-NEUT', None))

                    if end_char >= current_end:
                        # print('END OF INSIDE')
                        # print(token_idx, token.text, 'INSIDE')
                        # if it is the last token in the current aspect
                        # update state
                        current_start = None
                        current_end = 0
                        current_id = None

                elif start_char in starts:
                    # print('START')
                    
                    # starts may be not unique
                    dupl_start_idxs = [i for i, x in enumerate(starts) if x == start_char]
                    # get the largest span
                    if len(dupl_start_idxs) > 1:
                        local_end = 0
                        for idx in dupl_start_idxs:
                            new_end = ends[idx]
                            if new_end > current_end:
                                local_end = new_end
                                current_id = idx
                    else:
                        current_id = dupl_start_idxs[0]

                    if end_char < ends[current_id]:
                        # print('doesnt match', ends[current_id])
                        # print(token_idx, token.text, 'BEGIN')

                        current_start = starts[current_id]
                        current_end = ends[current_id]

                    sentence_aspect_labels.append(label2id.get('B-ASPECT', None))
                    sentiment_value = sentiment[current_id]
                    if sentiment_value == 'positive':
                        sentence_sentiment_labels.append(label2id_sent.get('B-POS', None))
                    elif sentiment_value == 'negative':
                        sentence_sentiment_labels.append(label2id_sent.get('B-NEG', None))
                    elif sentiment_value == 'neutral' or sentiment_value == 'both':
                        sentence_sentiment_labels.append(label2id_sent.get('B-NEUT', None))

                else:  # other cases
                    # print('OTHER')
                    sentence_aspect_labels.append(label2id.get('O', None))
                    sentence_sentiment_labels.append(label2id_sent.get('O', None))
                    # print(token_idx, token.text, 'OUTSIDE')

            if len(sentence_aspect_labels) != len(sentence) or\
            len(sentence_sentiment_labels) != len(sentence):
                print('MISMATCHED LABELING')
                print(sentence_aspect_labels)
                print('length of sentence aspect labels', len(sentence_aspect_labels))
                print('length of sentence sentiment labels', len(sentence_sentiment_labels))
                print(sentence)
                print('length of sentence', len(sentence))
                print(sent.text)

                bad_labels += 1

            data.append(sentence)
            aspect_labels.append(sentence_aspect_labels)
            sentiment_labels.append(sentence_sentiment_labels)

    logging.warning('Bad labels %d' % bad_labels)

    return review_ids, sentences, data, aspect_labels, sentiment_labels

In [None]:
train_aspects[train_aspects['text_id'] == 922226]  # overlapping aspects

Unnamed: 0,text_id,category,mention,start,end,sentiment
1250,922226,Whole,автомобиле,43,53,neutral
1251,922226,Comfort,салон,74,79,negative
1252,922226,Comfort,тесный,100,106,negative
1253,922226,Comfort,неудобный,109,118,negative
1254,922226,Whole,это не машина,143,156,negative
1255,922226,Comfort,Багажник,158,166,negative
1256,922226,Comfort,не отличается просторностью,172,199,negative
1257,922226,Comfort,просторностью,186,199,positive
1258,922226,Comfort,сверчки,220,227,negative
1259,922226,Comfort,салону,237,243,negative


In [None]:
text = train_reviews[train_reviews['text_id'] == 1234566]['text'].values[0]

text

'Наша белочка с нами уже 6 лет, менять пока не собираемся. Когда встал вопрос о том, какую машину можно купить в пределах 6000$, чтобы недорогая в обслуживании и экономичная в бензине, то выбар пал на наш белый Sens. Выбрали именно белый цвет-смотрится очень красиво. Ездили на нем постоянно, ремонтировали редко-настоящаяя лошадка. Из плюсов -в салоне достаточно хороший пластик -удобные сидения -прошла хороший краш тест (показала себя молодцом) -недорогая в обслуживании Из минусов: -движок 1.3 слабоват конечно( пробовала недавно ездить на Ланосе-сразу почувствовала разницу) В общем машинкой очень довольны, но время идет и хочется что-нибудь поновее, хотя и продавать жалко. '

In [None]:
train_aspects[train_aspects['text_id'] == 1234566]

Unnamed: 0,text_id,category,mention,start,end,sentiment
3499,1234566,Whole,машину,90,96,positive
3500,1234566,Costs,недорогая,134,143,positive
3501,1234566,Costs,обслуживании,146,158,positive
3502,1234566,Costs,экономичная,161,172,positive
3503,1234566,Costs,бензине,175,182,positive
3504,1234566,Whole,Sens,210,214,positive
3505,1234566,Appearance,смотрится,242,251,positive
3506,1234566,Appearance,красиво,258,265,positive
3507,1234566,Reliability,ремонтировали редко,292,311,positive
3508,1234566,Comfort,салоне,345,351,positive


In [None]:
doc = nlp(text)

for sent in doc.sentences:
    for token in sent.tokens:
        print(token.text, token.start_char, token.end_char)

Наша 0 4
белочка 5 12
с 13 14
нами 15 19
уже 20 23
6 24 25
лет 26 29
, 29 30
менять 31 37
пока 38 42
не 43 45
собираемся 46 56
. 56 57
Когда 58 63
встал 64 69
вопрос 70 76
о 77 78
том 79 82
, 82 83
какую 84 89
машину 90 96
можно 97 102
купить 103 109
в 110 111
пределах 112 120
6000 121 125
$ 125 126
, 126 127
чтобы 128 133
недорогая 134 143
в 144 145
обслуживании 146 158
и 159 160
экономичная 161 172
в 173 174
бензине 175 182
, 182 183
то 184 186
выбар 187 192
пал 193 196
на 197 199
наш 200 203
белый 204 209
Sens 210 214
. 214 215
Выбрали 216 223
именно 224 230
белый 231 236
цвет 237 241
- 241 242
смотрится 242 251
очень 252 257
красиво 258 265
. 265 266
Ездили 267 273
на 274 276
нем 277 280
постоянно 281 290
, 290 291
ремонтировали 292 305
редко-настоящаяя 306 322
лошадка 323 330
. 330 331
Из 332 334
плюсов 335 341
- 342 343
в 343 344
салоне 345 351
достаточно 352 362
хороший 363 370
пластик 371 378
- 379 380
удобные 380 387
сидения 388 395
- 396 397
прошла 397 403
хороший 404 411
кра

In [None]:
review_ids, sentences, data, aspect_labels, sentiment_labels = get_data(train_reviews[train_reviews['text_id'] == 1234566], train_aspects)



In [None]:
review_ids

[1234566, 1234566, 1234566, 1234566, 1234566, 1234566]

In [None]:
print(data[4])
print(aspect_labels[4])
print(sentiment_labels[4])

['Из', 'плюсов', '-', 'в', 'салоне', 'достаточно', 'хороший', 'пластик', '-', 'удобные', 'сидения', '-', 'прошла', 'хороший', 'краш', 'тест', '(', 'показала', 'себя', 'молодцом', ')', '-', 'недорогая', 'в', 'обслуживании', 'Из', 'минусов', ':', '-', 'движок', '1.3', 'слабоват', 'конечно', '(', 'пробовала', 'недавно', 'ездить', 'на', 'Ланосе', '-', 'сразу', 'почувствовала', 'разницу', ')']
[2, 2, 2, 2, 0, 2, 2, 0, 2, 0, 0, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2]
[6, 6, 6, 6, 0, 6, 6, 0, 6, 0, 0, 6, 6, 6, 0, 1, 6, 6, 6, 6, 6, 6, 0, 6, 0, 6, 6, 6, 6, 2, 3, 6, 6, 6, 6, 6, 6, 6, 0, 6, 6, 6, 6, 6]


Full data:

In [None]:
train_review_ids, train_sentences, train_data, train_aspect_labels, train_sentiment_labels = get_data(train_reviews, train_aspects)



In [None]:
len(train_data)

6771

In [None]:
dev_review_ids, dev_sentences, dev_data, dev_aspect_labels, dev_sentiment_labels = get_data(dev_reviews, dev_aspects)



In [None]:
len(dev_data)

2237

## Saving data

### Full data

In [None]:
train_max_dataset = pd.DataFrame({
    'review_id': train_review_ids,
    'sentence_text': train_sentences,
    'sentence_tokens': train_data,
    'aspect_labels': train_aspect_labels,
    'sentiment_labels': train_sentiment_labels
})

train_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/train_max_ner.tsv', index=False, sep='\t')

In [None]:
dev_max_dataset = pd.DataFrame({
    'review_id': dev_review_ids,
    'sentence_text': dev_sentences,
    'sentence_tokens': dev_data,
    'aspect_labels': dev_aspect_labels,
    'sentiment_labels': dev_sentiment_labels
})

In [None]:
eval_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(eval_reviews['text_id'].values.tolist())]
eval_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/eval_max_ner.tsv', index=False, sep='\t')

eval_max_dataset.head()

Unnamed: 0,review_id,sentence_text,sentence_tokens,aspect_labels,sentiment_labels
0,36381,Красиво.,"[Красиво, .]","[2, 2]","[6, 6]"
1,36381,Вкусно.,"[Вкусно, .]","[0, 2]","[0, 6]"
2,36381,"Обслуживание ""на уровне"".","[Обслуживание, "", на, уровне, "", .]","[0, 2, 2, 2, 2, 2]","[0, 6, 6, 6, 6, 6]"
3,36381,"Вот на этот ""уровень"" оно и ориентированно, не...","[Вот, на, этот, "", уровень, "", оно, и, ориенти...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
4,36381,"Ресторан создавал впечатление пустого, только ...","[Ресторан, создавал, впечатление, пустого, ,, ...","[0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]","[4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]"


In [None]:
test_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(test_reviews['text_id'].values.tolist())]
test_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/test_max_ner.tsv', index=False, sep='\t')

test_max_dataset.head()

Unnamed: 0,review_id,sentence_text,sentence_tokens,aspect_labels,sentiment_labels
559,13100,"решили этот день рождения отметить семьей-я,же...","[решили, этот, день, рождения, отметить, семье...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]","[6, 6, 6, 6, 6, 6, 6, 6, 6, 6]"
560,13100,"За пару дней заказал столик,что оказалось весь...","[За, пару, дней, заказал, столик, ,, что, оказ...","[2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[6, 6, 6, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
561,13100,При входе улыбчивая девушка хостесс встретила ...,"[При, входе, улыбчивая, девушка, хостесс, встр...","[2, 2, 2, 0, 1, 0, 2, 0, 1, 1, 2, 2, 2]","[6, 6, 6, 0, 1, 0, 6, 0, 1, 1, 6, 6, 6]"
562,13100,Обслуживал молодой человек Антон.При выборе бл...,"[Обслуживал, молодой, человек, Антон., При, вы...","[0, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[4, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."
563,13100,"На десерт жена взяла классический Наполеон,про...","[На, десерт, жена, взяла, классический, Наполе...","[2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[6, 0, 6, 6, 6, 0, 6, 6, 6, 6, 6, 6, 6, 6, 6, ..."


### Restaurants

In [None]:
train_restaurants_max_dataset = train_max_dataset[train_max_dataset['review_id'].isin(train_restaurants_reviews['text_id'].values.tolist())]

train_restaurants_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/train_restaurants_max_ner.tsv', index=False, sep='\t')

In [None]:
eval_restaurants_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(eval_restaurants_reviews['text_id'].values.tolist())]

eval_restaurants_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/eval_restaurants_max_ner.tsv', index=False, sep='\t')

In [None]:
test_restaurants_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(test_restaurants_reviews['text_id'].values.tolist())]

test_restaurants_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/test_restaurants_max_ner.tsv', index=False, sep='\t')

### Automobiles

In [None]:
train_automobiles_max_dataset = train_max_dataset[train_max_dataset['review_id'].isin(train_automobiles_reviews['text_id'].values.tolist())]

train_automobiles_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/train_automobiles_max_ner.tsv', index=False, sep='\t')

In [None]:
eval_automobiles_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(eval_automobiles_reviews['text_id'].values.tolist())]

eval_automobiles_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/eval_automobiles_max_ner.tsv', index=False, sep='\t')

In [None]:
test_automobiles_max_dataset = dev_max_dataset[dev_max_dataset['review_id'].isin(test_automobiles_reviews['text_id'].values.tolist())]

test_automobiles_max_dataset.to_csv('/content/drive/MyDrive/Summarization/aspects/test_automobiles_max_ner.tsv', index=False, sep='\t')