# NER
Extracts entities:

| Type Label  | Description                                          |
| ----------- | ---------------------------------------------------- |
| EVENT       | Named hurricanes, battles, wars, sports events, etc. |
| GPE         | Countries, cities, states.                           |
| LOC         | Non-GPE locations, mountain ranges, bodies of water. |
| PERSON      | People, including fictional.                         |
| NORP        | Nationalities or religious or political groups.      |
| FAC         | Buildings, airports, highways, bridges, etc.         |
| ORG         | Companies, agencies, institutions, etc.              |
| PRODUCT     | Objects, vehicles, foods, etc. (Not services.)       |
| WORK_OF_ART | Titles of books, songs, etc.                         |
| LAW         | Named documents made into laws.                      |
| LANGUAGE    | Any named language.                                  |
| QUANTITY    | Measurements, as of weight or distance.              |
| ORDINAL     | “first”, “second”, etc.                              |
| CARDINAL    | Numerals that do not fall under another type.        |
| MONEY       | Monetary values, including unit.                     |
| PERCENT     | Percentage, including ”%“.                           |
| DATE        | Absolute or relative dates or periods.               |
| TIME        | Times smaller than a day.                            |

> via https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html


In [1]:
numeric_labels = ['QUANTITY', 'CARDINAL', 'ORDINAL', 'PERCENT', 'MONEY', 'TIME', 'DATE']
non_numeric_labels = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']

In [18]:
!python3 -m pip install tensorflow
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
%pip install sacremoses sentencepiece transformers
%pip install spacy
!python -m spacy download en_core_web_lg
!python -m spacy download es_core_news_lg

# Verify install:
import tensorflow as tf
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

## Spacy

In [1]:
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 

2023-03-21 22:38:00.847609: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-21 22:38:01.064608: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-21 22:38:01.064644: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-21 22:38:02.167314: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-21

In [2]:
# coreVocabulary, syntax, entities, vectors
nlp = {
    'en': spacy.load("en_core_web_lg"),
    'es': spacy.load("es_core_news_lg")
}

### EN Sample

In [3]:
doc = nlp['en'](sentences[0])
for token in doc:
    print(token.pos_, token.dep_, token.text)

PROPN nsubj Apple
AUX aux is
VERB ROOT looking
ADP prep at
VERB pcomp buying
PROPN dobj U.K.
NOUN ccomp startup
ADP prep for
SYM quantmod $
NUM compound 1
NUM pobj billion


In [4]:
for token in doc.ents:
    print(token.label_, token)

ORG Apple
GPE U.K.
MONEY $1 billion


In [5]:
displacy.render(doc, style='dep',jupyter=True)

In [6]:
# displacy.serve(doc, style="ent")  # no need within a Jupyter notebook
displacy.render(doc, style='ent',jupyter=True)

### ES Spanish
+ As Spacy spanish models lack quantitative entities extraction (numbers, money, dates...), we will be using translation to english in order to obtain them

In [7]:
texts = [
    """Johann Sebastian Bach (Eisenach, Sacro Imperio Romano Germánico, 21 de marzo 1685 - Leipzig, Sacro Imperio Romano Germánico, 28 de julio de 1750) fue un compositor, músico, director de orquesta, maestro de capilla, cantor y profesor alemán del período barroco.
Fue el miembro más importante de una de las familias de músicos más destacadas de la historia, con más de 35 compositores famosos: la familia Bach. Tuvo una gran fama como organista y clavecinista en toda Europa por su gran técnica y capacidad de improvisar música al teclado. Además del órgano y del clavecín, tocaba el violín y la viola da gamba.""",
    "Me gasté 60€ ayer por la tarde"
]
text = texts[0]

In [9]:
# Helsinki - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) 
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

In [8]:
# https://stackoverflow.com/questions/70043467/how-to-run-huggingface-helsinki-nlp-models
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")

input_ids = tokenizer(text, return_tensors="pt").input_ids
outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3)
print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))


  from .autonotebook import tqdm as notebook_tqdm


Generated: ['Johann Sebastian Bach (March 21, 1685 – July 28, 1750) was a German composer, musician, conductor, master of chapel, singer and teacher of the Baroque period. He was the most important member of one of the most outstanding families of musicians in history, with more than 35 famous composers: the Bach family. He had a great reputation as an organist and keyacinist throughout Europe for his great technique and ability to improvise music to the keyboard. In addition to the organ and the keyring, he played the violin and the viola da gamba.', 'Johann Sebastian Bach (March 21, 1685 – July 28, 1750) was a German composer, musician, conductor, master of chapel, singer and teacher of the Baroque period. He was the most important member of one of the most outstanding families of musicians in history, with more than 35 famous composers: the Bach family. He had a great reputation as an organist and keyacinist all over Europe for his great technique and ability to improvise music to t

[{'translation_text': 'I spent 60€ yesterday Johann Sebastian Bach (Eisenach, Holy Roman Empire Germanic, 21 March - Leipzig, Holy Roman Empire Germanic, 28 July 1750) was a composer, musician, conductor, master of the chapel, singer and German teacher of the Baroque period. He was the most important member of one of the most outstanding families of musicians in history, with more than 35 famous composers: the Bach family. He had a great reputation as organist and keyacinist throughout Europe for his great technique and ability to improvise music to the keyboard. In addition to the organ and the keyring, he played the violin and the viola da gamba.'}]

In [13]:
def sp_quantitative_entities(text):
    tokens = []

    doc = nlp['es'](text)
    tokens.extend(doc.ents)
    displacy.render(doc, style='ent',jupyter=True)

    result = translator(text)
    translation_text = result[0]['translation_text']
    print(result)

    doc = nlp['en'](translation_text)
    tokens.extend(doc.ents)
    displacy.render(doc, style='ent',jupyter=True)

    # redundancy
    return sorted(tokens, key=lambda x: x.label_)
    return sorted(tokens)



In [11]:
text

'Johann Sebastian Bach (Eisenach, Sacro Imperio Romano Germánico, 21 de marzo 1685 - Leipzig, Sacro Imperio Romano Germánico, 28 de julio de 1750) fue un compositor, músico, director de orquesta, maestro de capilla, cantor y profesor alemán del período barroco.\nFue el miembro más importante de una de las familias de músicos más destacadas de la historia, con más de 35 compositores famosos: la familia Bach. Tuvo una gran fama como organista y clavecinista en toda Europa por su gran técnica y capacidad de improvisar música al teclado. Además del órgano y del clavecín, tocaba el violín y la viola da gamba.'

In [14]:
ents = sp_quantitative_entities(text)
for token in ents:
    print(token.label_, token)

[{'translation_text': 'Johann Sebastian Bach (March 21, 1685 – July 28, 1750) was a German composer, musician, conductor, master of chapel, singer and teacher of the Baroque period. He was the most important member of one of the most outstanding families of musicians in history, with more than 35 famous composers: the Bach family. He had a great reputation as an organist and keyacinist throughout Europe for his great technique and ability to improvise music to the keyboard. In addition to the organ and the keyring, he played the violin and the viola da gamba.'}]


CARDINAL one
CARDINAL more than 35
DATE March 21, 1685
DATE July 28, 1750
LOC Eisenach
LOC Sacro Imperio Romano
LOC Leipzig
LOC Sacro Imperio Romano
LOC Europa
LOC Europe
NORP German
ORG keyacinist
PER Johann Sebastian Bach
PER Bach
PERSON Johann Sebastian Bach
PERSON Bach


In [15]:
ents = sp_quantitative_entities(texts[1])
for token in ents:
    print(token.label_, token)



[{'translation_text': 'I spent €60 yesterday afternoon'}]


DATE yesterday
MONEY 60
TIME afternoon


In [17]:
# numeric_labels
for token in doc.ents:
    if token.label_ in numeric_labels:
        print(token.label_, token)

MONEY 60€
DATE yesterday
DATE 21 March - Leipzig
DATE 28 July 1750
CARDINAL one
CARDINAL more than 35


## price-parser
https://github.com/scrapinghub/price-parser/

CONS:
+ no extrae más de un valor por expresión

In [9]:
%pip install price-parser


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [35]:
from price_parser import Price
price = Price.fromstring("me debes 22,90 €. Y Juan 30€")
price

Price(amount=Decimal('22.90'), currency='€')

In [36]:
Price.fromstring(text)

Price(amount=Decimal('21'), currency='Lei')

## Dates
+ extraction
+ parsing (to datetime object)

MORE:

https://github.com/facebook/duckling
https://nlp.stanford.edu/software/sutime.shtml

### duckling
https://github.com/facebook/duckling

https://github.com/facebook/duckling/blob/main/LICENSE

CONS: 
+ parsea mal las fechas
+ haskell

In [34]:
# INSTALLATION FAILURE
# error situation:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/uninstall.sh)"
sudo rm -rf /opt/homebrew
sudo apt-get install ghc
sudo apt-get install libgmp10 libgmp-dev


!brew install pcre
sudo apt install cabal-install

!wget -qO- https://get.haskellstack.org/ | sh
stack upgrade
!git clone https://github.com/facebook/duckling.git
!cd duckling

https://apple.stackexchange.com/questions/373749/stack-install-fails-with-linker-errors-when-building-system-filepath-after-updat


Stack version 2.7.3 already appears to be installed at:
  /usr/local/bin/stack

Use 'stack upgrade' or your OS's package manager to upgrade,
or pass '-f' to this script to over-write the existing binary, e.g.:
  curl -sSL https://get.haskellstack.org/ | sh -s - -f

To install to a different location, pass '-d DESTDIR', e.g.:
  curl -sSL https://get.haskellstack.org/ | sh -s - -d /opt/stack/bin



### Dates
1. Date extraction from text
   + https://github.com/akoumjian/datefinder
      + MIT:  Commercial use
      + 😱 CONS:
         + spanish

2. Date parsing
   + https://github.com/scrapinghub/dateparser
      + BSD 3-Clause "New" or "Revised" License. : Commercial use

In [None]:
!pip install datefinder
!pip install dateparser

In [37]:
import datetime
import datefinder
import dateparser

In [38]:
string_with_dates = """
...: entries are due by January 4th, 2017 at 8:00pm
...: created 01/15/2005 by ACME Inc. and associates.
...: """

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)

2017-01-04 20:00:00
2005-01-15 00:00:00


In [40]:
text

'Johann Sebastian Bach (Eisenach, Sacro Imperio Romano Germánico, 21 de marzo 1685 - Leipzig, Sacro Imperio Romano Germánico, 28 de julio de 1750) fue un compositor, músico, director de orquesta, maestro de capilla, cantor y profesor alemán del período barroco.\nFue el miembro más importante de una de las familias de músicos más destacadas de la historia, con más de 35 compositores famosos: la familia Bach. Tuvo una gran fama como organista y clavecinista en toda Europa por su gran técnica y capacidad de improvisar música al teclado. Además del órgano y del clavecín, tocaba el violín y la viola da gamba.'

In [39]:
matches = datefinder.find_dates(text)
for match in matches:
    print(match)

2023-03-21 00:00:00
2023-03-28 00:00:00
2035-03-21 00:00:00


In [44]:
dateparser.parse('tomorrow', settings={'RELATIVE_BASE': datetime.datetime(1992, 1, 1)})

  now = self.get_local_tz().localize(now)


datetime.datetime(1992, 1, 2, 0, 0)

In [45]:
date_tests = [
    'Martes 21 de Octubre de 2014',
    'Mañana',
    'A las 11 de la noche',
]

In [46]:
for date_test in date_tests:
    result = dateparser.parse(date_test)
    print(result)

2014-10-21 00:00:00
2023-03-22 20:50:20.249581
None


In [48]:
# Requiere extracción antes de parse
dateparser.parse('Nos vemos el martes 21 de Octubre de 2014')

### number-parser
https://github.com/scrapinghub/number-parser/

In [None]:
!pip install number-parser

In [42]:
from number_parser import parse
tests = [
    "I have two hats and thirty seven coats",
    "One, Two, Three go",
    "First day of year two thousand",
    "mi número es el seis seis cuatro dos uno nueve dos cuatro uno"
]

In [43]:
for test in tests:
    print(parse(test))

I have 2 hats and 37 coats
1, 2, 3 go
1 day of year 2000
mi número es el 6 6 4 2 1 9 2 4 1
