## Import libraries

In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Read data

In [3]:
train = pd.read_json('../data/train.jsonl', lines=True)
train

Unnamed: 0,ners,sentences,id
0,"[[0, 5, CITY], [16, 23, PERSON], [34, 41, PERS...",Бостон взорвали Тамерлан и Джохар Царнаевы из ...,0
1,"[[21, 28, PROFESSION], [53, 67, ORGANIZATION],...",Умер избитый до комы гитарист и сооснователь г...,1
2,"[[0, 4, PERSON], [37, 42, COUNTRY], [47, 76, O...",Путин подписал распоряжение о выходе России из...,2
3,"[[0, 11, PERSON], [36, 47, PROFESSION], [49, 6...",Бенедикт XVI носил кардиостимулятор\nПапа Римс...,3
4,"[[0, 4, PERSON], [17, 29, ORGANIZATION], [48, ...",Обама назначит в Верховный суд латиноамериканк...,4
...,...,...,...
514,"[[42, 46, COUNTRY], [82, 87, COUNTRY], [104, 1...",Глава Малайзии: мы не хотим противостоять Кита...,514
515,"[[1, 4, PRODUCT], [31, 33, FACILITY], [35, 44,...",«Союз» впервые пристыковался к МКС за 6 часов\...,515
516,"[[0, 4, PERSON], [8, 12, PERSON], [45, 52, AGE...",Трамп и Путин сделали совместное заявление к 7...,516
517,"[[0, 9, NATIONALITY], [58, 72, PERSON], [101, ...",Российский магнат устроил самую дорогую свадьб...,517


## Second solution: train spacy model
Train the spacy model on the given data, so the model can recognize different types of named entities.

Tutorial I used: https://www.newscatcherapi.com/blog/train-custom-named-entity-recognition-ner-model-with-spacy-v3

In [4]:
# Install the library

# !pip install spacy



In [4]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("ru") # load a new spacy model
doc_bin = DocBin() # create a DocBin object

### Preprocess data
The part of code is taken from the tutorial with some modifications

In [6]:
from spacy.util import filter_spans

skip = 0
for id, row in tqdm(train.iterrows()):
    text = row['sentences']
    labels = row['ners']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        start, end = int(start), int(end)
        span = doc.char_span(start, end+1, label=label, alignment_mode="contract")
        # Spacy  model is sensitive to spaces
        # so skip span if the first or last character is a space
        if span is None or span.text.strip() != span.text:
            skip += 1
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)
print("Skipped:", skip)

doc_bin.to_disk("training_data.spacy") # save the docbin object

519it [00:04, 123.80it/s]


Skipped: 175


Transformer is not available for russian, so I use tok2vec model

In [8]:
# Command to create config file for the model
! python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [9]:
# Command to train the model with training data
! python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id=0

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.76    1.24    0.73    4.05    0.01
  0     200       2343.65  16632.25   30.23   44.79   22.81    0.30
  0     400       2171.15  11349.53   42.56   50.07   37.00    0.43
  1     600       4121.03  10613.18   57.00   65.08   50.70    0.57
  1     800       4314.17   8790.57   61.74   62.95   60.57    0.62
  1    1000       2531.60   9134.75   67.37   72.46   62.94    0.67
  2    1200       7066.52   7672.29   69.41   74.34   65.09    0.69
  2    1400       1948.08   7755.01   69.71   74.16   65.77    0.70
  3    1600       6707.81   7128.19   74.57   78.53   70.99    0.75
  3    1800       3337.74   6156.37   76.90   78

In [5]:
# Same fucntion as in the 1st solution (without dictionary)
def predict(nlp, df, sentence_row):
    '''
    Use trained spacy model to predict
    named entites in a text

    Arguments:
    nlp - the spacy model
    df - dataframe with data
    sentence_row - name of the row with text
    '''
    ents = []
    for id, row in df.iterrows():
        # Get the entities for each row
        doc = nlp(row[sentence_row])
        ents.append([[e.start_char, e.end_char-1, e.label_] for e in doc.ents])
    return ents

### Predict

In [8]:
test = pd.read_json('../data/test.jsonl', lines=True)
test

Unnamed: 0,senences,id
0,Владелец «Бирмингема» получил шесть лет тюрьмы...,584
1,Акция протеста на Майдане Независимости объявл...,585
2,Фольксваген может перейти под контроль Порше \...,586
3,В Москве покажут фильмы Чарли Чаплина с живой ...,587
4,Чулпан Хаматова сыграет главную роль в фильме ...,588
...,...,...
60,ОБСЕ назвала референдум о статусе Крыма незако...,644
61,Египетского студента могут выслать из страны з...,645
62,Геннадий Онищенко отправлен в отставку\nГеннад...,646
63,Племянник Алишера Усманова разбился в ДТП\nВид...,647


In [9]:
nlp_ner = spacy.load("model-best")
answer = predict(nlp_ner, test, 'senences')

In [10]:
# Prepare the dataframe for output
output_df = test.drop(columns=['senences'])
output_df['ners'] = answer

output_df

Unnamed: 0,id,ners
0,584,"[[0, 19, PROFESSION], [30, 38, NUMBER], [64, 6..."
1,585,"[[0, 13, EVENT], [18, 38, FACILITY], [40, 59, ..."
2,586,"[[0, 10, PERSON], [39, 43, PERSON], [52, 61, P..."
3,587,"[[2, 7, CITY], [24, 36, PERSON], [72, 91, WORK..."
4,588,"[[0, 14, PERSON], [50, 61, PERSON], [63, 78, P..."
...,...,...
60,644,"[[0, 3, ORGANIZATION], [34, 38, STATE_OR_PROVI..."
61,645,"[[70, 82, PERSON], [118, 129, COUNTRY], [145, ..."
62,646,"[[0, 16, PERSON], [30, 37, EVENT], [39, 55, PE..."
63,647,"[[10, 25, PERSON], [27, 34, EVENT], [38, 40, E..."


In [11]:
# Write the output
with open('test.jsonl', 'w') as f:
    f.write(output_df.to_json(orient='records', lines=True, force_ascii=False))

## Conclusion

This solution scored **0.52** points on CodaLab. 

Some ideas for improvement: 
- use some pre-computed vectors (e.g. ru_core_news_lg)
- split the training dataset into train and eval (in the tutorial the model was trained without splitting, but I think it can improve the model performance)