# NER
+ extracts entities:

| Type Label  | Description                                          |
| ----------- | ---------------------------------------------------- |
| EVENT       | Named hurricanes, battles, wars, sports events, etc. |
| GPE         | Countries, cities, states.                           |
| LOC         | Non-GPE locations, mountain ranges, bodies of water. |
| PERSON      | People, including fictional.                         |
| NORP        | Nationalities or religious or political groups.      |
| FAC         | Buildings, airports, highways, bridges, etc.         |
| ORG         | Companies, agencies, institutions, etc.              |
| PRODUCT     | Objects, vehicles, foods, etc. (Not services.)       |
| WORK_OF_ART | Titles of books, songs, etc.                         |
| LAW         | Named documents made into laws.                      |
| LANGUAGE    | Any named language.                                  |
| QUANTITY    | Measurements, as of weight or distance.              |
| ORDINAL     | “first”, “second”, etc.                              |
| CARDINAL    | Numerals that do not fall under another type.        |
| MONEY       | Monetary values, including unit.                     |
| PERCENT     | Percentage, including ”%“.                           |
| DATE        | Absolute or relative dates or periods.               |
| TIME        | Times smaller than a day.                            |

> via https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html


In [1]:
numeric_labels = ['QUANTITY', 'CARDINAL', 'ORDINAL', 'PERCENT', 'MONEY', 'TIME', 'DATE']
non_numeric_labels = ['EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'NORP', 'ORG', 'PERSON', 'PRODUCT', 'WORK_OF_ART']

In [2]:
!pip install sacremoses
!pip install spacy
!python -m spacy download en_core_web_lg


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
2023-03-19 12:18:57.736603: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-19 12:18:58.920248: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerro


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Spacy

### EN English

In [3]:
import spacy
from spacy import displacy
from spacy.lang.en.examples import sentences 

2023-03-19 12:19:24.241359: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-19 12:19:25.590939: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-19 12:19:25.591020: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-19 12:19:27.178539: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detec

In [4]:
nlp = spacy.load("en_core_web_lg")
doc = nlp(sentences[0])

In [5]:
for token in doc:
    print(token.pos_, token.dep_, token.text)

PROPN nsubj Apple
AUX aux is
VERB ROOT looking
ADP prep at
VERB pcomp buying
PROPN dobj U.K.
NOUN ccomp startup
ADP prep for
SYM quantmod $
NUM compound 1
NUM pobj billion


In [6]:
for token in doc.ents:
    print(token.label_, token)

ORG Apple
GPE U.K.
MONEY $1 billion


In [7]:
displacy.render(doc, style='dep',jupyter=True)

In [8]:
# displacy.serve(doc, style="ent")  # no need within a Jupyter notebook
displacy.render(doc, style='ent',jupyter=True)

### ES Spanish
+ As Spacy spanish models lack quantitative entities extraction (numbers, money, dates...), we will be using translation to english in order to obtain them

In [27]:
text = """
Me gasté 60€ ayer
Johann Sebastian Bach (Eisenach, Sacro Imperio Romano Germánico, 21 de marzo 1685 - Leipzig, Sacro Imperio Romano Germánico, 28 de julio de 1750) fue un compositor, músico, director de orquesta, maestro de capilla, cantor y profesor alemán del período barroco.
Fue el miembro más importante de una de las familias de músicos más destacadas de la historia, con más de 35 compositores famosos: la familia Bach. Tuvo una gran fama como organista y clavecinista en toda Europa por su gran técnica y capacidad de improvisar música al teclado. Además del órgano y del clavecín, tocaba el violín y la viola da gamba. 
"""

In [9]:
# Helsinki - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) 
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")

In [11]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

In [12]:
result = translator(text)
result

[{'translation_text': 'I spent 60€ yesterday Johann Sebastian Bach (Eisenach, Holy Roman Empire Germanic, 21 March - Leipzig, Holy Roman Empire Germanic, 28 July 1750) was a composer, musician, conductor, master of the chapel, singer and German teacher of the Baroque period. He was the most important member of one of the most outstanding families of musicians in history, with more than 35 famous composers: the Bach family. He had a great reputation as organist and keyacinist throughout Europe for his great technique and ability to improvise music to the keyboard. In addition to the organ and the keyring, he played the violin and the viola da gamba.'}]

In [13]:
translation_text = result[0]['translation_text']

In [14]:
doc = nlp(translation_text)

In [15]:
# displacy.serve(doc, style="ent")  # no need within a Jupyter notebook
displacy.render(doc, style='ent',jupyter=True)

In [16]:
for token in doc.ents:
    print(token.label_, token)

MONEY 60€
DATE yesterday
PERSON Johann Sebastian Bach
GPE Eisenach
GPE Holy Roman Empire Germanic
DATE 21 March - Leipzig
GPE Holy Roman Empire Germanic
DATE 28 July 1750
NORP German
CARDINAL one
CARDINAL more than 35
PERSON Bach
ORG keyacinist
LOC Europe


In [17]:
# numeric_labels
for token in doc.ents:
    if token.label_ in numeric_labels:
        print(token.label_, token)

MONEY 60€
DATE yesterday
DATE 21 March - Leipzig
DATE 28 July 1750
CARDINAL one
CARDINAL more than 35


## price-parser
https://github.com/scrapinghub/price-parser/

CONS:
+ no extrae más de un valor por expresión

In [9]:
!pip install price-parser


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [13]:
from price_parser import Price
price = Price.fromstring("me debes 22,90 €. Y Juan 30€")
price

Price(amount=Decimal('22.90'), currency='€')

In [33]:
Price.fromstring(text)

Price(amount=Decimal('60'), currency='€')

## Dates
+ extraction
+ parsing (to datetime object)

MORE:

https://github.com/facebook/duckling
https://nlp.stanford.edu/software/sutime.shtml

### duckling
https://github.com/facebook/duckling

https://github.com/facebook/duckling/blob/main/LICENSE

CONS: 
+ parsea mal las fechas
+ haskell

In [34]:
# INSTALLATION FAILURE
# error situation:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/uninstall.sh)"
sudo rm -rf /opt/homebrew
sudo apt-get install ghc
sudo apt-get install libgmp10 libgmp-dev


!brew install pcre
sudo apt install cabal-install

!wget -qO- https://get.haskellstack.org/ | sh
stack upgrade
!git clone https://github.com/facebook/duckling.git
!cd duckling

https://apple.stackexchange.com/questions/373749/stack-install-fails-with-linker-errors-when-building-system-filepath-after-updat


Stack version 2.7.3 already appears to be installed at:
  /usr/local/bin/stack

Use 'stack upgrade' or your OS's package manager to upgrade,
or pass '-f' to this script to over-write the existing binary, e.g.:
  curl -sSL https://get.haskellstack.org/ | sh -s - -f

To install to a different location, pass '-d DESTDIR', e.g.:
  curl -sSL https://get.haskellstack.org/ | sh -s - -d /opt/stack/bin


### datefinder - extract dates from text
https://github.com/akoumjian/datefinder

MIT:  Commercial use

😱 CONS:
+ spanish

In [23]:
!pip install datefinder

Collecting datefinder
  Downloading datefinder-0.7.3-py2.py3-none-any.whl (10 kB)
Installing collected packages: datefinder
Successfully installed datefinder-0.7.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [28]:
import datefinder

string_with_dates = """
...: entries are due by January 4th, 2017 at 8:00pm
...: created 01/15/2005 by ACME Inc. and associates.
...: """

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)


2017-01-04 20:00:00
2005-01-15 00:00:00


In [31]:
matches = datefinder.find_dates(text)
for match in matches:
    print(match)

2023-03-21 00:00:00
2023-03-28 00:00:00
2035-03-20 00:00:00


### number-parser
https://github.com/scrapinghub/number-parser/

In [14]:
!pip install number-parser

Collecting number-parser
  Downloading number_parser-0.3.0-py2.py3-none-any.whl (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.8/50.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: number-parser
Successfully installed number-parser-0.3.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [29]:
from number_parser import parse
parse("I have two hats and thirty seven coats")
parse("One, Two, Three go")
parse("First day of year two thousand")

'1 day of year 2000'

In [30]:
parse("mi número es el seis seis cuatro dos uno nueve dos cuatro uno")

'mi número es el 6 6 4 2 1 9 2 4 1'

### dateparser
https://github.com/scrapinghub/dateparser

BSD 3-Clause "New" or "Revised" License. Permissions
+ Commercial use
+ Modification
+ Distribution
+ Private use 

You may also like...

-  `price-parser <https://github.com/scrapinghub/price-parser/>`__ - A
   small library for extracting price and currency from raw text
   strings.
-  `number-parser <https://github.com/scrapinghub/number-parser/>`__ -
   Library to convert numbers written in the natural language to it's
   equivalent numeric forms.

In [2]:
!pip install dateparser

Collecting dateparser
  Downloading dateparser-1.1.7-py2.py3-none-any.whl (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.4/293.4 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting tzlocal
  Downloading tzlocal-4.3-py3-none-any.whl (20 kB)
Collecting backports.zoneinfo
  Downloading backports.zoneinfo-0.2.1-cp38-cp38-manylinux1_x86_64.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytz-deprecation-shim
  Using cached pytz_deprecation_shim-0.1.0.post0-py2.py3-none-any.whl (15 kB)
Collecting tzdata
  Using cached tzdata-2022.7-py2.py3-none-any.whl (340 kB)
Installing collected packages: tzdata, backports.zoneinfo, pytz-deprecation-shim, tzlocal, dateparser
Successfully installed backports.zoneinfo-0.2.1 dateparser-1.1.7 pytz-deprecation-shim-0.1.0.post0 tzdata-2022.7 tzlocal-4.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip ava

In [3]:
import dateparser

In [None]:
dateparser.parse('tomorrow', settings={'RELATIVE_BASE': datetime.datetime(1992, 1, 1)})

In [6]:
date_tests = [
'Martes 21 de Octubre de 2014',
'Mañana',
'A las 11 de la noche',
]

In [7]:
for date_test in date_tests:
    result = dateparser.parse(date_test)
    print(result)
# datetime.datetime(2014, 10, 21, 0, 0)

2014-10-21 00:00:00
2023-03-21 20:34:19.378303
None


In [5]:
# REquiere extracción antes de parse
dateparser.parse('Nos vemos el martes 21 de Octubre de 2014')