<a href="https://colab.research.google.com/github/componavt/topkar-space/blob/main/src/ner/nltk-ru-ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER with NLTK
Let's use the prepared file with text data from the TopKar project. We will try to extract named entities from the text using the nltk library.

From Named Entity Recognition in nltk, we can get organization, GPE, person as a category.

- Named Entity Recognition (NER)
    - NLP task to identify important named entities in the text
        - People, places, organizations
        - Dates, states, works of art
    - Can be used alongside topic identification
    - Who? What? When? Where?

In [42]:
from pprint import pprint
import matplotlib.pyplot as plt

> Note: Before using NER through NLTK, you must install 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words' packages

In [43]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [44]:
import pandas as pd

csv_files = [
#    "https://raw.githubusercontent.com/componavt/topkar-space/main/data/sample10.csv",
    "https://raw.githubusercontent.com/componavt/topkar-space/main/data/sample100.csv",
]

df = pd.concat([pd.read_csv(url, sep = ';') for url in csv_files], ignore_index=True)
df = df.reset_index()  # make sure indexes pair with number of rows
df.head()

Unnamed: 0,index,Name,Synonym,Text
0,0,Наволоцкая поляна,наволоцкая поляна,Пахотная поляна в устье Пижейручья.
1,1,Галайские пожни,галайские пожни,"покосы на юге острова Галайский, см. Галайский..."
2,2,Pahag’ärv,pahagärv,"Маленькое озерко за хутором Poh'd'ad'g', между..."
3,3,Sod’järvt’e,sodjärvte,Поляны по дороге в дер. Сидорово.
4,4,Piirdoinkodi,piirdoinkodi,"Дом в дер. Койвусельга, фам. Михайлов."


In [45]:
def extract_entity_names(t):
  entity_names = []

  if hasattr(t, 'label') and t.label() == 'NE':
    entity_names.append(' '.join([child[0] for child in t]))
  else:
    for child in t:
      entity_names.extend(extract_entity_names(child))

  return entity_names

In [47]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('stopwords')
nltk.download('punkt_tab')# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')          # nltk.ne_chunk_sents

# import string
from nltk.corpus import stopwords
stop_words = stopwords.words("russian")

!pip install -U pymorphy2==0.9.1
import pymorphy2
morph = pymorphy2.MorphAnalyzer(lang='ru')

# Display
# print("df (toponyms):")
# print(df.head().to_string())
# print(df.describe())

# Convert quotes to list
lines = df['Text'].tolist()
print()
print("Number of toponyms:", len(lines))

#df['Text'] = df['Text'].replace(np.nan, '')
df['Text'] = df['Text'].replace({float('nan'): ""})# replace empty text with "None"

for i, li in enumerate(lines):  #for li in lines:
    print("\n{0}, {1} = ".format(i, li))
    # Tokenize the text line into sentences
    sentences = sent_tokenize(li, language='russian')
    print("\nsentences = ", sentences)
                                                                          # , tagset='universal') universal, wsj, brown
    tokens = [nltk.word_tokenize(sent, language='russian', preserve_line=True)
      for sent in sentences]

    pos_sentences = [nltk.pos_tag(tok) for tok in tokens]

    # Create the named entity chunks: chunked_sentences
    chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=False)# binary=False => + GPE, PERSON, ORGANIZATION,

    for sent in chunked_sentences:
      #print("sent = ", sent)
      for chunk in sent:
        #print("chunk = ", chunk)
        if hasattr(chunk, "label"): # and chunk.label() == 'NE':
          print("{0} ({1})".format(chunk, chunk.label()))

#    for sent in sentences:#                                                                universal, wsj, brown
#      tokens = nltk.word_tokenize(sent, language='russian', preserve_line=True)# , tagset='universal')

      # Tag each tokenized sentence into parts of speech: pos_sentences
#      pos_tokens = nltk.pos_tag(tokens)
      #pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences]
#      print("pos_tokens = ", pos_tokens)

#      chunked_tokens = nltk.ne_chunk_sents(pos_tokens, binary=True)# binary=False => + GPE, PERSON, ORGANIZATION,
#      print("chunked_tokens = ", chunked_tokens)

#      for word in tokens:
        # no num and comma
#        if (word.isalpha()) and (not word in stop_words):
#          lemma = morph.parse(word)[0].normal_form
          # print("lemma = ", lemma)

#      for ch_sent in chunked_tokens:
#        for ch in ch_sent:
#          if hasattr(ch, "label") and ch.label() == 'NE':
#            print(ch)
#    if i > 2:
#      break

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!



Number of toponyms: 100

0, Пахотная поляна в устье Пижейручья. = 

sentences =  ['Пахотная поляна в устье Пижейручья.']
(PERSON Пахотная/JJ) (PERSON)

1, покосы на юге острова Галайский, см. Галайский Остров = 

sentences =  ['покосы на юге острова Галайский, см. Галайский Остров']

2, Маленькое озерко за хутором Poh'd'ad'g', между оз. Sar'gär'v и Vougedg'är'v. = 

sentences =  ["Маленькое озерко за хутором Poh'd'ad'g', между оз. Sar'gär'v и Vougedg'är'v."]
(PERSON Маленькое/JJ) (PERSON)

3, Поляны по дороге в дер. Сидорово. = 

sentences =  ['Поляны по дороге в дер. Сидорово.']
(PERSON Поляны/JJ) (PERSON)

4, Дом в дер. Койвусельга, фам. Михайлов. = 

sentences =  ['Дом в дер. Койвусельга, фам. Михайлов.']

5, Река Колпь, вытекает из оз. Jokšar'v, течет в Вологодск. обл. = 

sentences =  ["Река Колпь, вытекает из оз. Jokšar'v, течет в Вологодск.", 'обл.']
(GPE Река/NN) (GPE)
(ORGANIZATION Колпь/NN) (ORGANIZATION)
(PERSON Jokšar/NNP) (PERSON)

6, Поле находится в 600-700 м на востоке