## Import libraries

In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Read data
The data is saved in the "assignment3/data" folder

In [2]:
train = pd.read_json('../data/train.jsonl', lines=True)
train

Unnamed: 0,ners,sentences,id
0,"[[0, 5, CITY], [16, 23, PERSON], [34, 41, PERS...",Бостон взорвали Тамерлан и Джохар Царнаевы из ...,0
1,"[[21, 28, PROFESSION], [53, 67, ORGANIZATION],...",Умер избитый до комы гитарист и сооснователь г...,1
2,"[[0, 4, PERSON], [37, 42, COUNTRY], [47, 76, O...",Путин подписал распоряжение о выходе России из...,2
3,"[[0, 11, PERSON], [36, 47, PROFESSION], [49, 6...",Бенедикт XVI носил кардиостимулятор\nПапа Римс...,3
4,"[[0, 4, PERSON], [17, 29, ORGANIZATION], [48, ...",Обама назначит в Верховный суд латиноамериканк...,4
...,...,...,...
514,"[[42, 46, COUNTRY], [82, 87, COUNTRY], [104, 1...",Глава Малайзии: мы не хотим противостоять Кита...,514
515,"[[1, 4, PRODUCT], [31, 33, FACILITY], [35, 44,...",«Союз» впервые пристыковался к МКС за 6 часов\...,515
516,"[[0, 4, PERSON], [8, 12, PERSON], [45, 52, AGE...",Трамп и Путин сделали совместное заявление к 7...,516
517,"[[0, 9, NATIONALITY], [58, 72, PERSON], [101, ...",Российский магнат устроил самую дорогую свадьб...,517


## Baseline solution: use pre-trained Spacy model
Spacy has a pre-trained model for recognizing named entities in Russian. The limitations of the model are that it recognizes only three types of entities: person (PER), organization (ORG) or location (LOC). Let's try to use the model "stupidly" and see what percentage of entities it covers.

In [3]:
# Install the library and pre-trained model

# !pip install spacy
# !python -m spacy download ru_core_news_lg

In [4]:
# Dictionary to replace entity names as in the given dataset
label_dict = {'PER': 'PERSON', 'LOC': 'LOCATION', 'ORG': 'ORGANIZATION'}

# Simple function to predict named entities
# Sentence row is defined by the user 
# (because of a type in test data)
def predict(nlp, df, sentence_row):
    '''
    Use spacy pre-trained model to predict
    named entites in a text (only PERSON, LOCATION and ORGANIZATION)

    Arguments:
    nlp - the spacy model
    df - dataframe with data
    sentence_row - name of the row with text
    '''
    ents = []
    for id, row in df.iterrows():
        # Get the entities for each row
        doc = nlp(row[sentence_row])
        ents.append([[e.start_char, e.end_char-1, label_dict.get(e.label_)] for e in doc.ents])
    return ents

### Predict on the test data

In [5]:
test = pd.read_json('../data/test.jsonl', lines=True)
test

Unnamed: 0,senences,id
0,Владелец «Бирмингема» получил шесть лет тюрьмы...,584
1,Акция протеста на Майдане Независимости объявл...,585
2,Фольксваген может перейти под контроль Порше \...,586
3,В Москве покажут фильмы Чарли Чаплина с живой ...,587
4,Чулпан Хаматова сыграет главную роль в фильме ...,588
...,...,...
60,ОБСЕ назвала референдум о статусе Крыма незако...,644
61,Египетского студента могут выслать из страны з...,645
62,Геннадий Онищенко отправлен в отставку\nГеннад...,646
63,Племянник Алишера Усманова разбился в ДТП\nВид...,647


In [6]:
import spacy

nlp = spacy.load("ru_core_news_lg")
answer = predict(nlp, test, 'senences')

In [7]:
output_df = test.drop(columns=['senences'])
output_df['ners'] = answer

output_df

Unnamed: 0,id,ners
0,584,"[[10, 19, ORGANIZATION], [47, 69, ORGANIZATION..."
1,585,"[[18, 38, LOCATION], [90, 103, PERSON], [202, ..."
2,586,"[[39, 61, PERSON], [65, 75, LOCATION], [78, 85..."
3,587,"[[2, 7, LOCATION], [24, 36, PERSON], [96, 109,..."
4,588,"[[0, 14, PERSON], [50, 61, PERSON], [63, 78, P..."
...,...,...
60,644,"[[0, 3, ORGANIZATION], [34, 38, LOCATION], [84..."
61,645,"[[63, 68, PERSON], [70, 93, PERSON], [396, 417..."
62,646,"[[0, 16, PERSON], [39, 55, PERSON], [91, 106, ..."
63,647,"[[10, 25, PERSON], [49, 55, LOCATION], [121, 1..."


In [8]:
# Write the output
with open('test.jsonl', 'w') as f:
    f.write(output_df.to_json(orient='records', lines=True, force_ascii=False))

## Conclusion
The solution is very straightforward and inflexible. It scored **0.06** points on CodaLab. In the next solution I will try to train the spacy model on the competition data.