<a href="https://colab.research.google.com/github/Viny2030/UNED/blob/main/practica02_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hash en Python
Una función hash es una función que dada una entrada de longitud variable devuelve una secuencia de longitud fija, con algunas propiedades interesantes. Si por ejemplo aplicamos como entrada la cadena El Libro De Python, la salida será la siguiente.

Aplicaciones de Función Hash
Las funciones hash tienen aplicaciones en diferentes sectores. Explicamos a continuación sus casos de uso más relevantes:

Integridad de información: Podemos usar las funciones hash para asegurarnos de que un determinado contenido digital no ha sido modificado. Si por ejemplo calculamos el hash de un vídeo o un libro y lo almacenamos, tendremos una “huella digital” de dicho contenido. Si en un futuro nos envían ese mismo vídeo o libro, podemos calcular el hash otra vez y compararlo con el que teníamos almacenado anteriormente. Esto nos ahorra tener que ir fotograma a fotograma o página a página comparando ambos archivos.
Generar números aleatorios: Podemos usar las funciones hash para generar números aleatorios, o para ser más preciso para generar números pseudoaleatorios.
Firma digital: En la firma digital se suele firmar sólo el hash del mensaje en vez del contenido entero, lo que resulta más eficiente y reduce ciertos vectores de ataque.
Merkle Trees: Los merkle trees también pueden ser usados para resumir información, donde la misma es dividida en pequeños trozos y su hash es calculado recursivamente hasta obtener un único hash llamado merkle root. Estos son muy utilizados en la blockchain.
Tipos de Funciones Hash
Existen diferentes funciones hash, donde cada una tiene sus casos de uso. Algunas de las características más importantes son la longitud de la salida y el algoritmo que usan:

BLAKE: Tiene variantes como la BLAKE-2, BLAKE-3, siendo la última anunciada en 2020. Existen diferentes variantes en función del número de bits de su salida.
MD: Tiene múltiples variantes como la MD1, MD2, MD3, MD4 y MD5. El MD5 es muy usado para integridad de datos y fue introducido en la RFC 1321.
SHA: Tiene variantes como la SHA256, SHA512, SHA224, SHA384. El SHA256 es el usado por la criptomoneda Bitcoin.
KECCAK-256: Usado por la criptomoneda Ethereum.
Funciones Hash en Python
Gracias a la librería hashlib de Python disponemos de prácticamente todas las funciones hash que existen. Veamos por ejemplo como usar la sha256.

In [1]:
import hashlib

# Create the hash object
hash_object = hashlib.sha3_256()  ### Usado por la criptomoneda Ethereum.

# Update the hash object with the data
hash_object.update(b"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

# Get the binary digest
salidae= hash_object.digest()  # Use digest() instead of bindigest()

# Print the digest
print(salidae)

b'Y\xe4\xfc,\xb0\xeb\xf9\xe1\x8d_\x16\x19CrA\x1c\xe9q\xe8\xc2\xc4iz\xae\xfa\xfc^g\x00\xb9\x9e\t'


In [2]:
len(salidae)

32

In [3]:
import hashlib

# Create the hash object
hash_object = hashlib.sha224()

# Update the hash object with the data
hash_object.update(b"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

# Get the binary digest
salidas = hash_object.digest()  # Use digest() instead of bindigest()

# Print the digest
print(salidas)

b'.l\xa5A\xd6\x89f\xe8{K\x15[zSi7[]\xdc\xc1\xdc\xb3\xe5\xdc"\x91E5'


In [4]:
len(salidas)

28

In [5]:
import hashlib

# Create the hash object
hash_object = hashlib.sha256()##El SHA256 es el usado por la criptomoneda Bitcoin.

# Update the hash object with the data
hash_object.update(b"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

# Get the binary digest
salidab = hash_object.digest()  # Use digest() instead of bindigest()

# Print the digest
print(salidab)

b'(\xdf\x82fz\x1d\x11u\xaeJy\x83Q\xf7/\rk\x9ed\\\xd7\xc2m\xc1\xf0\x14\x0f*\xdc\xd4\x0e\xa9'


In [6]:
len(salidab)

32

In [7]:
import hashlib

salida = hashlib.sha256(b"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.").hexdigest()
print(salida)

28df82667a1d1175ae4a798351f72f0d6b9e645cd7c26dc1f0140f2adcd40ea9


In [8]:
len(salida)

64

Existen diferentes funciones hash, y en el caso anterior hemos usado la sha256. Una función hash se puede ver como una función resumen, ya que nos permite “resumir” un conjunto de datos de longitud variable en una secuencia de longitud fija (y relativamente corta). Podríamos también meter el libro entero de El Quijote un su función hash sha256 sería:

También podemos acceder al digest_size, es decir a la longitud de la salida. Este será un valor fijo dentro de cada función hash, y por ejemplo en el caso de sha256 es 32 bytes, o lo que viene siendo lo mismo, 256 bits. De ahí viene su nombre.

In [9]:
import hashlib

m = hashlib.sha256()
m.update(b"El Libro De Python")
salida = m.hexdigest()

print(salida)

f7b5c532807800c540f5e4476ea1f6d968294fc34c90f2e7e64435ea3c054ce6


In [10]:
print(m.digest_size)

32


También podemos hacer el hash de múltiples entradas:

In [11]:
import hashlib

m = hashlib.sha256()
m.update(b"Secuencia 1")
m.update(b"Secuencia 2")
m.update(b"Secuencia 3")
salida = m.hexdigest()

print(salida)

8bfe6a71680cdf4cc4c4024b3808f0d865a729c58695b145e7408555c24aab29


A continuación podemos ver ejemplos para varias funciones hash, donde todas usan la misma entrada. Podemos ver como las salidas son diferentes y tienen distinta longitud.

In [12]:
import hashlib

print(hashlib.sha256(b"El Libro de Python").hexdigest())
print(hashlib.sha224(b"El Libro de Python").hexdigest())
print(hashlib.sha512(b"El Libro de Python").hexdigest())
print(hashlib.blake2b(b"El Libro de Python").hexdigest())
print(hashlib.blake2s(b"El Libro de Python").hexdigest())
print(hashlib.blake2s(b"El Libro de Python").hexdigest())
print(hashlib.md5(b"El Libro de Python").hexdigest())

3f0eb88c12b73f8235f3bc5a19336d32a41bbe6743291c97e8fb00ea2d3520e0
a46e8c9207522d305f79b818f37170c8b3094acb607ed3645854ea38
01a489b59eb18cc2d297e3be6918ba1cff2875514900ce781a95c006a05eac68c9dffc0ae9bb64c00dbc628fa1b2a28159e8cee2875e86157d82c0998a786beb
2728dcb655d2572bff0f0ecad1518ddc005a7896ea18e0f47c07cf54efcb639266c3056ebd4ef62abad34d8ccd074925012cde3da4a351f0f14831b7c36f6a48
cae14ea7acc2827b66212f3da5ea35d8d65fe248fc00030ddb9f9e5113f120a7
cae14ea7acc2827b66212f3da5ea35d8d65fe248fc00030ddb9f9e5113f120a7
77dd74c5fa2c54e26cc67b5ab47b38fd


In [82]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'




3197928453018144401
coffee


In [83]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


https://www.youtube.com/watch?v=pLJm0WSIVDk&list=PLc2rvfiptPSSS-iwKS_lxI3MZr8Mbi4Zu&index=1


In [13]:
!pip install -U pip setuptools wheel
!pip install -U spacy



##en_core_web_sm: modelo de lenguaje pequeño
##en_core_web_md : modelo de lenguaje grande
## en_core_web_lg: modelo de lenguaje grande


In [14]:
!python -m spacy download en_core_web_trf


Collecting en-core-web-trf==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


https://kgptalkie.com/


## Linguistic annotations

## spacy da anotaciones dentro de los tipos de palabras dentro de una estructura de textos gramaticales

In [15]:
import spacy
nlp= spacy.load("en_core_web_trf")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
for token in doc:
  print(token.text,"//",token.pos_,"//", token.dep_ )  ## TEXTO DEL TOKEN, POSICION DEL TOKEN GRAMATICAL,

  model.load_state_dict(torch.load(filelike, map_location=device))


Jane // PROPN // nsubj
bought // VERB // ROOT
me // PRON // dative
these // DET // det
books // NOUN // dobj
. // PUNCT // punct
Jane // PROPN // nsubj
bought // VERB // ROOT
a // DET // det
book // NOUN // dobj
for // ADP // dative
me // PRON // pobj
. // PUNCT // punct
She // PRON // nsubj
dropped // VERB // ROOT
a // DET // det
line // NOUN // dobj
to // ADP // prep
him // PRON // pobj
. // PUNCT // punct
Thank // VERB // ROOT
you // PRON // dobj
. // PUNCT // punct
She // PRON // nsubj
sleeps // VERB // ROOT
. // PUNCT // punct
I // PRON // nsubj
sleep // VERB // ROOT
a // DET // det
lot // NOUN // npadvmod
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid.the // PROPN // punct
cat // NOUN // nsubjpass
was // AUX // auxpass
chased // VERB // ROOT
by // ADP // agent
the // DET // det
dog // NOUN // pobj
. // PUNCT // punct
I // PRON // nsubjpass
was // AUX // auxpass
born // VERB // ROOT
in // ADP // prep
Madrid // PROPN /

In [16]:
for token in doc:
  print(token.text)  ##

Jane
bought
me
these
books
.
Jane
bought
a
book
for
me
.
She
dropped
a
line
to
him
.
Thank
you
.
She
sleeps
.
I
sleep
a
lot
.
I
was
born
in
Madrid.the
cat
was
chased
by
the
dog
.
I
was
born
in
Madrid
during
1995.Out
of
all
this
,
something
good
will
come
.
Susan
left
after
the
rehearsal
.
She
did
it
well
.
She
sleeps
during
the
morning
,
but
she
sleeps
.


In [17]:
for token in doc:
  print(token.pos_)  ###

PROPN
VERB
PRON
DET
NOUN
PUNCT
PROPN
VERB
DET
NOUN
ADP
PRON
PUNCT
PRON
VERB
DET
NOUN
ADP
PRON
PUNCT
VERB
PRON
PUNCT
PRON
VERB
PUNCT
PRON
VERB
DET
NOUN
PUNCT
PRON
AUX
VERB
ADP
PROPN
NOUN
AUX
VERB
ADP
DET
NOUN
PUNCT
PRON
AUX
VERB
ADP
PROPN
ADP
NUM
ADP
DET
PRON
PUNCT
PRON
ADJ
AUX
VERB
PUNCT
PROPN
VERB
ADP
DET
NOUN
PUNCT
PRON
VERB
PRON
ADV
PUNCT
PRON
VERB
ADP
DET
NOUN
PUNCT
CCONJ
PRON
VERB
PUNCT


In [84]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

Jane 13439688181152664240 Xxxx J ane True False True en
bought 5204146470106475914 xxxx b ght True False False en
me 18197037023634208128 xx m me True False False en
these 6459564349623679250 xxxx t ese True False False en
books 17837313582142403287 xxxx b oks True False False en
. 12646065887601541794 . . . False False False en
Jane 13439688181152664240 Xxxx J ane True False True en
bought 5204146470106475914 xxxx b ght True False False en
a 11901859001352538922 x a a True False False en
book 13814433107111459297 xxxx b ook True False False en
for 16037325823156266367 xxx f for True False False en
me 18197037023634208128 xx m me True False False en
. 12646065887601541794 . . . False False False en
She 5252949303365547547 Xxx S She True False True en
dropped 11269071702302113671 xxxx d ped True False False en
a 11901859001352538922 x a a True False False en
line 9545763306533606446 xxxx l ine True False False en
to 3791531372978436496 xx t to True False False en
him 1739263527992748485

In [18]:
import spacy

In [19]:
nlp = spacy.load("en_core_web_sm")  ## modelo de idioma con el modelo pequeño



In [20]:
text = ('''"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."''')

In [21]:
doc = nlp(text)

In [22]:
doc

"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."

In [23]:
for token in doc:
  print(token.text)

"
Jane
bought
me
these
books
.
Jane
bought
a
book
for
me
.
She
dropped
a
line
to
him
.
Thank
you
.
She
sleeps
.
I
sleep
a
lot
.
I
was
born
in
Madrid.the
cat
was
chased
by
the
dog
.
I
was
born
in
Madrid
during
1995.Out
of
all
this
,
something
good
will
come
.
Susan
left
after
the
rehearsal
.
She
did
it
well
.
She
sleeps
during
the
morning
,
but
she
sleeps
.
"


In [85]:
# Construction 1
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

# Construction 2
from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

In [90]:
tokens = tokenizer('''"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."''')
print(len(tokens))
print(tokens)

82
"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."


In [93]:
texts = ["Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."]
for doc in tokenizer.pipe(texts, batch_size=50):
    pass

In [94]:
from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)

In [95]:

tok_exp = nlp.tokenizer.explain("(don't)")
assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]

https://pypi.org/project/beautifultable/

In [24]:
!pip install beautifultable # Install the 'beautifultable' package
import spacy
from beautifultable import BeautifulTable # import

Collecting beautifultable
  Downloading beautifultable-1.1.0-py2.py3-none-any.whl.metadata (13 kB)
Downloading beautifultable-1.1.0-py2.py3-none-any.whl (28 kB)
Installing collected packages: beautifultable
Successfully installed beautifultable-1.1.0


In [25]:
import spacy
from  beautifultable  import BeautifulTable

In [26]:
table = BeautifulTable()
table.columns.header = ["text token", "POS" ]
for token in doc:
  table.rows.append([token.text, token.pos_])
print(table)

+------------+-------+
| text token |  POS  |
+------------+-------+
|     "      | PUNCT |
+------------+-------+
|    Jane    | PROPN |
+------------+-------+
|   bought   | VERB  |
+------------+-------+
|     me     | PRON  |
+------------+-------+
|   these    |  DET  |
+------------+-------+
|   books    | NOUN  |
+------------+-------+
|     .      | PUNCT |
+------------+-------+
|    Jane    | PROPN |
+------------+-------+
|   bought   | VERB  |
+------------+-------+
|     a      |  DET  |
+------------+-------+
|    book    | NOUN  |
+------------+-------+
|    for     |  ADP  |
+------------+-------+
|     me     | PRON  |
+------------+-------+
|     .      | PUNCT |
+------------+-------+
|    She     | PRON  |
+------------+-------+
|  dropped   | VERB  |
+------------+-------+
|     a      |  DET  |
+------------+-------+
|    line    | NOUN  |
+------------+-------+
|     to     |  ADP  |
+------------+-------+
|    him     | PRON  |
+------------+-------+
|     .    

https://universaldependencies.org/u/pos/

caracteristicas linguisticas

In [68]:
spacy_tok = spacy.load('en_core_web_sm')
car_lin = spacy_tok(doc)
car_lin

"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."

No hay mucha diferencia entre la revisión analizada y la original, pero veremos qué ha sucedido realmente. Podemos ver cómo se ha realizado el análisis visualmente a través de explacy.

In [69]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py

--2024-12-03 15:06:38--  https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6896 (6.7K) [text/plain]
Saving to: ‘explacy.py’


2024-12-03 15:06:39 (105 MB/s) - ‘explacy.py’ saved [6896/6896]



Una dependencia sintactica es la relacion  entre 2 palabras y una sentence con una palabra siendo el gobierno de otra relacion

In [27]:
import spacy

In [28]:
nlp = spacy.load("en_core_web_sm")  ## modelo de idioma con el modelo pequeño

In [29]:
text = ('''"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."''')

In [30]:
doc = nlp(text)
doc

"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."

In [99]:
import spacy
import explacy

# Change 'en' to 'en_core_web_sm'
nlp = spacy.load("en_core_web_sm")
explacy.print_parse_info(nlp, 'Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.')



Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

In [70]:
import explacy
explacy.print_parse_info(spacy_tok,'Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.')

Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

In [71]:
explacy.print_parse_info(spacy_tok,'Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.')

Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

Aquí hay un ejemplo del tokenizador de espacios en blanco más básico. Toma el vocabulario compartido, por lo que puede construir Docobjetos. Cuando se lo llama en un texto, devuelve un Docobjeto que consiste en el texto dividido en caracteres de espacio simple. Luego podemos sobrescribir el nlp.tokenizeratributo con una instancia de nuestro tokenizador personalizado.

In [101]:
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
print([token.text for token in doc])

['Jane', 'bought', 'me', 'these', 'books.Jane', 'bought', 'a', 'book', 'for', 'me.She', 'dropped', 'a', 'line', 'to', 'him.', 'Thank', 'you.She', 'sleeps.I', 'sleep', 'a', 'lot.I', 'was', 'born', 'in', 'Madrid.the', 'cat', 'was', 'chased', 'by', 'the', 'dog.I', 'was', 'born', 'in', 'Madrid', 'during', '1995.Out', 'of', 'all', 'this', ',', 'something', 'good', 'will', 'come.Susan', 'left', 'after', 'the', 'rehearsal.', 'She', 'did', 'it', 'well.She', 'sleeps', 'during', 'the', 'morning,', 'but', 'she', 'sleeps.']


In [103]:
!pip install transformers



Ejemplo 2: Tokenizadores de terceros (fragmentos de palabras BERT)
Puede utilizar el mismo enfoque para conectar cualquier otro tokenizador de terceros. Su invocable personalizado solo necesita devolver un Docobjeto con los tokens producidos por su tokenizador. En este ejemplo, el contenedor utiliza el fragmento de palabra BERT tokenizer , proporcionado por el tokenizersbiblioteca. Los tokens disponibles en el Docobjeto devuelto por spaCy ahora coinciden exactamente con los fragmentos de palabras producidos por el tokenizador.

In [104]:
from tokenizers import BertWordPieceTokenizer
from spacy.tokens import Doc
import spacy
from transformers import BertTokenizerFast # Import BertTokenizerFast

class BertTokenizer:
    def __init__(self, vocab, vocab_file, lowercase=True):
        self.vocab = vocab
        # Use BertTokenizerFast to load the vocabulary
        self._tokenizer = BertTokenizerFast.from_pretrained(vocab_file)

    def __call__(self, text):
        tokens = self._tokenizer(text, return_offsets_mapping=True) # Use __call__ method
        words = tokens.tokens() # Get tokens
        spaces = [True] * len(words)
        # Adjust spaces based on offsets
        offsets = tokens["offset_mapping"]
        for i in range(len(offsets) - 1):
            spaces[i] = offsets[i + 1][0] > offsets[i][1]
        # Handle last token
        spaces[-1] = False
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
# Use 'bert-base-uncased' instead of the vocabulary file
nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
print(doc.text, [token.text for token in doc])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

[CLS]jane bought me these books.jane bought a book for me.she dropped a line to him. thank you.she sleeps.i sleep a lot.i was born in madrid.the cat was chased by the dog.i was born in madrid during 1995.out of all this , something good will come.susan left after the rehearsal. she did it well.she sleeps during the morning, but she sleeps.[SEP] ['[CLS]', 'jane', 'bought', 'me', 'these', 'books', '.', 'jane', 'bought', 'a', 'book', 'for', 'me', '.', 'she', 'dropped', 'a', 'line', 'to', 'him', '.', 'thank', 'you', '.', 'she', 'sleeps', '.', 'i', 'sleep', 'a', 'lot', '.', 'i', 'was', 'born', 'in', 'madrid', '.', 'the', 'cat', 'was', 'chased', 'by', 'the', 'dog', '.', 'i', 'was', 'born', 'in', 'madrid', 'during', '1995', '.', 'out', 'of', 'all', 'this', ',', 'something', 'good', 'will', 'come', '.', 'susan', 'left', 'after', 'the', 'rehearsal', '.', 'she', 'did', 'it', 'well', '.', 'she', 'sleeps', 'during', 'the', 'morning', ',', 'but', 'she', 'sleeps', '.', '[SEP]']


Entrenamiento con tokenización personalizada v3.0
La configuración de entrenamiento de spaCy describe las configuraciones, los hiperparámetros, la canalización y el tokenizador utilizados para construir y entrenar la canalización. El [nlp.tokenizer]bloque hace referencia a una función registrada que toma el nlpobjeto y devuelve un tokenizador. Aquí, estamos registrando una función llamada whitespace_tokenizeren el @tokenizersregistroPara asegurarse de que spaCy sepa cómo construir su tokenizador durante el entrenamiento, puede pasar su archivo Python configurando --code functions.pycuándo ejecutaspacy train.

In [105]:
@spacy.registry.tokenizers("whitespace_tokenizer")
def create_whitespace_tokenizer():
    def create_tokenizer(nlp):
        return WhitespaceTokenizer(nlp.vocab)

    return create_tokenizer

Las funciones registradas también pueden tomar argumentos que luego se pasan desde la configuración. Esto le permite cambiar y realizar un seguimiento de diferentes configuraciones rápidamente. Aquí, la función registrada llamada bert_word_piece_tokenizertoma dos argumentos: la ruta a un archivo de vocabulario y si se debe poner el texto en minúsculas. Las sugerencias de tipo de Python strgarantizan boolque los valores recibidos tengan el tipo correcto.

Para evitar codificar rutas locales en su archivo de configuración, también puede configurar la ruta de vocabulario en la CLI usando la --nlp.tokenizer.vocab_file anulación cuando ejecuta spacy trainPara obtener más detalles sobre el uso de funciones registradas, consulte la documentación en capacitación con código personalizado .

In [113]:
@spacy.registry.tokenizers("whitespace_tokenizer")
def create_whitespace_tokenizer():
    def create_tokenizer(nlp):
        return WhitespaceTokenizer(nlp.vocab)
    return create_tokenizer

In [119]:
@spacy.registry.tokenizers("bert_word_piece_tokenizer")
def create_bert_tokenizer(vocab_file: str, lowercase: bool):
    def create_tokenizer(nlp):
        return BertTokenizer(nlp.vocab, vocab_file, lowercase)
    return create_tokenizer

In [122]:
from tokenizers import BertWordPieceTokenizer
from spacy.tokens import Doc
import spacy
from transformers import BertTokenizerFast

class BertTokenizer:
    def __init__(self, vocab, vocab_file, lowercase=True):
        self.vocab = vocab
        # Use BertTokenizerFast to load the vocabulary
        self._tokenizer = BertTokenizerFast.from_pretrained(vocab_file)

    def __call__(self, text):
        tokens = self._tokenizer(text, return_offsets_mapping=True) # Use __call__ method
        words = tokens.tokens() # Get tokens
        spaces = [True] * len(words)
        # Adjust spaces based on offsets
        offsets = tokens["offset_mapping"]
        for i in range(len(offsets) - 1):
            spaces[i] = offsets[i + 1][0] > offsets[i][1]
        # Handle last token
        spaces[-1] = False
        return Doc(self.vocab, words=words, spaces=spaces)

@spacy.registry.tokenizers("bert_word_piece_tokenizer")
def create_bert_tokenizer(vocab_file: str, lowercase: bool):
    def create_tokenizer(nlp):
        return BertTokenizer(nlp.vocab, vocab_file, lowercase)
    return create_tokenizer

# Get the tokenizer creation function from the registry
create_tokenizer = spacy.registry.tokenizers.get("bert_word_piece_tokenizer")

# Call the function to create the tokenizer instance
tokenizer = create_tokenizer("bert-base-uncased", True)(spacy.blank("en")) # Provide nlp object

# Now you can use the tokenizer
doc = tokenizer("Jane bought me these books.")
print([token.text for token in doc])

['[CLS]', 'jane', 'bought', 'me', 'these', 'books', '.', '[SEP]']


# **Etiquetado de partes del discurso**
Después de la tokenización, podemos analizar y etiquetar una variedad de partes del discurso en el texto de un párrafo. SpaCy utiliza modelos estadísticos en segundo plano para predecir qué etiqueta se aplicará a cada palabra en función del contexto.

Lematización Es el proceso de extraer la forma básica/no flexiva de la palabra. El lema puede ser como Por ejemplo:

Adjetivos: best, better → good Adverbios: worst, worst → badly Sustantivos: ducks, children → duck, child Verbos: standing,stood → stand

In [73]:
import pandas as pd

In [127]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

# explacy.print_parse_info expects only the nlp object and the sentence
explacy.print_parse_info(nlp, doc.text)

# If you want to print token-level information, you need to iterate and format the output yourself:
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Tag: {token.tag_}, Dep: {token.dep_}, Shape: {token.shape_}, is_alpha: {token.is_alpha}, is_stop: {token.is_stop}")




Dep tree           Token      Dep type  Lemma      Part of Sp
────────────────── ────────── ───────── ────────── ──────────
               ┌─► Jane       nsubj     Jane       PROPN     
           ┌┬──┼── bought     ROOT      buy        VERB      
           ││  └─► me         dative    I          PRON      
           ││  ┌─► these      det       these      DET       
           │└─►└── books      dobj      book       NOUN      
           └─────► .          punct     .          PUNCT     
               ┌─► Jane       nsubj     Jane       PROPN     
          ┌┬┬──┴── bought     ROOT      buy        VERB      
          │││  ┌─► a          det       a          DET       
          ││└─►└── book       dobj      book       NOUN      
          │└──►┌── for        dative    for        ADP       
          │    └─► me         pobj      I          PRON      
          └──────► .          punct     .          PUNCT     
               ┌─► She        nsubj     she        PRON      
        

In [74]:
tokenized_text = pd.DataFrame()

for i, token in enumerate(car_lin):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    tokenized_text.loc[i, 'dep'] = token.dep_
    tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct

tokenized_text[:20]

Unnamed: 0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop,is_punctuation
0,"""","("",)",PUNCT,``,punct,"""",False,False,True
1,Jane,"(Jane,)",PROPN,NNP,nsubj,Xxxx,True,False,False
2,bought,"(buy,)",VERB,VBD,ROOT,xxxx,True,False,False
3,me,"(I,)",PRON,PRP,dative,xx,True,True,False
4,these,"(these,)",DET,DT,det,xxxx,True,True,False
5,books,"(book,)",NOUN,NNS,dobj,xxxx,True,False,False
6,.,"(.,)",PUNCT,.,punct,.,False,False,True
7,Jane,"(Jane,)",PROPN,NNP,nsubj,Xxxx,True,False,False
8,bought,"(buy,)",VERB,VBD,ROOT,xxxx,True,False,False
9,a,"(a,)",DET,DT,det,x,True,True,False


Reconocimiento de entidades con nombre (NER)
Las entidades con nombre son objetos del mundo real, como personas, organizaciones, etc.

Spacy identifica automáticamente las siguientes entidades:

In [75]:
spacy.displacy.render(car_lin, style='ent', jupyter=True)

In [76]:
spacy.explain('GPE') # to explain POS tag

'Countries, cities, states'

# Análisis de dependencia
El análisis sintáctico o análisis de dependencia es un proceso de identificación de oraciones y asignación de una estructura sintáctica. Por ejemplo, el sujeto combinado con el objeto forma una oración. Spacy proporciona un árbol de análisis que se puede utilizar para generar esta estructura.

Detección de límites de oraciones
Determinar dónde comienza y termina una oración es una parte muy importante del procesamiento del lenguaje natural.

In [77]:
sentence_spans = list(car_lin.sents)
sentence_spans

["Jane bought me these books.,
 Jane bought a book for me.,
 She dropped a line to him.,
 Thank you.,
 She sleeps.,
 I sleep a lot.,
 I was born in Madrid.the cat was chased by the dog.,
 I was born in Madrid during 1995.Out of all this , something good will come.,
 Susan left after the rehearsal.,
 She did it well.,
 She sleeps during the morning, but she sleeps."]

In [128]:
displacy.serve([car_lin], style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [78]:
displacy.render(car_lin, style='dep', jupyter=True,options={'distance': 140})

In [129]:
displacy.render(car_lin, style="dep")

In [130]:
deps_parse = displacy.parse_deps(doc)
html = displacy.render(deps_parse, style="dep", manual=True)

In [132]:
ents_parse = displacy.parse_ents(doc)
html = displacy.render(ents_parse, style="ent", manual=True)

In [133]:
ents_parse = displacy.parse_spans(doc, options={"spans_key" : "orgs"})
html = displacy.render(ents_parse, style="span", manual=True)


Available keys: []


In [134]:
options = {"compact": True, "color": "blue"}
displacy.serve(doc, style="dep", options=options)




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [135]:
options = {"ents": ["PERSON", "ORG", "PRODUCT"],
           "colors": {"ORG": "yellow"}}
displacy.serve(doc, style="ent", options=options)


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [136]:
options = {"spans_key": "sc"}
displacy.serve(doc, style="span", options=options)


Available keys: []



Using the 'span' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


Desplácese hacia abajo si no puede ver el resultado anterior. Incluso puede personalizar el resultado del analizador de dependencias como se muestra a continuación.

In [79]:
options = {'compact': True, 'bg': 'violet','distance': 140,
           'color': 'white', 'font': 'Trebuchet MS'}
displacy.render(car_lin, jupyter=True, style='dep', options=options)

In [80]:
spacy.explain("ADJ") ,spacy.explain("det") ,spacy.explain("ADP") ,spacy.explain("prep")  # to understand tags

('adjective', 'determiner', 'adposition', 'prepositional modifier')

# **Procesando fragmentos de sustantivos**

In [81]:
noun_chunks_df = pd.DataFrame()

for i, chunk in enumerate(car_lin.noun_chunks):
    noun_chunks_df.loc[i, 'text'] = chunk.text
    noun_chunks_df.loc[i, 'root'] = chunk.root,
    noun_chunks_df.loc[i, 'root.text'] = chunk.root.text,
    noun_chunks_df.loc[i, 'root.dep_'] = chunk.root.dep_
    noun_chunks_df.loc[i, 'root.head.text'] = chunk.root.head.text

noun_chunks_df[:20]

Unnamed: 0,text,root,root.text,root.dep_,root.head.text
0,Jane,"(Jane,)","(Jane,)",nsubj,bought
1,me,"(me,)","(me,)",dative,bought
2,these books,"(books,)","(books,)",dobj,bought
3,Jane,"(Jane,)","(Jane,)",nsubj,bought
4,a book,"(book,)","(book,)",dobj,bought
5,me,"(me,)","(me,)",pobj,for
6,She,"(She,)","(She,)",nsubj,dropped
7,a line,"(line,)","(line,)",dobj,dropped
8,him,"(him,)","(him,)",pobj,to
9,you,"(you,)","(you,)",dobj,Thank


In [142]:
chunks = list(doc.noun_chunks)
# Instead of asserting a specific length, print the actual length and the first chunk
print(f"Number of noun chunks: {len(chunks)}")
if chunks:  # Check if there are any chunks before accessing the first one
    print(f"First noun chunk: {chunks[0].text}")

# You can examine the noun chunks and adjust your expectations accordingly
for i, chunk in enumerate(chunks):
    print(f"Noun chunk {i + 1}: {chunk.text}")

Number of noun chunks: 26
First noun chunk: Jane
Noun chunk 1: Jane
Noun chunk 2: me
Noun chunk 3: these books
Noun chunk 4: Jane
Noun chunk 5: a book
Noun chunk 6: me
Noun chunk 7: She
Noun chunk 8: a line
Noun chunk 9: him
Noun chunk 10: you
Noun chunk 11: She
Noun chunk 12: I
Noun chunk 13: I
Noun chunk 14: Madrid.the cat
Noun chunk 15: the dog
Noun chunk 16: I
Noun chunk 17: Madrid
Noun chunk 18: all this
Noun chunk 19: something
Noun chunk 20: Susan
Noun chunk 21: the rehearsal
Noun chunk 22: She
Noun chunk 23: it
Noun chunk 24: She
Noun chunk 25: the morning
Noun chunk 26: she


## Part of Speech (POS)

In [145]:
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

print([(ent.text, ent.label_) for ent in doc.ents])

[('Jane', 'PERSON'), ('Jane', 'PERSON'), ('Madrid', 'GPE'), ('1995.Out', 'CARDINAL'), ('Susan', 'PERSON'), ('the morning', 'TIME')]


In [148]:
import spacy
from spacy.language import Language
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

@Language.component("extract_person_orgs")
def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc

# To make the entities easier to work with, we'll merge them into single tokens
nlp.add_pipe("merge_entities")
nlp.add_pipe("extract_person_orgs")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
# If you're not in a Jupyter / IPython environment, use displacy.serve
displacy.render(doc, options={"fine_grained": True})



In [149]:
with doc.retokenize() as retokenizer:
  for ent in doc.ents:
      retokenizer.merge(ent)

## text.token = token
## text.pos_ = posicion
## token.tag = etiqueta
## token.dep = dependencia sintactica
## token.shape = tamaño palabra
## token.is_alpha  El texto del token consta de caracteres alfabéticos.
## token.is_stop = si es stopword

In [31]:
tok_l = doc.to_json()['tokens']
for t in tok_l:
  head = tok_l[t['head']]
  print(f"'{doc[t['start']:t['end']]}' is {t['dep']} of '{doc[head['start']:head['end']]}'")

'"' is punct of '.Jane bought a book for'
'Jane bought me these' is nsubj of '.Jane bought a book for'
'.Jane bought a book for' is ROOT of '.Jane bought a book for'
'.She' is dative of '.Jane bought a book for'
'a line to him.' is det of 'you.She sleeps.'
'you.She sleeps.' is dobj of '.Jane bought a book for'
'I' is punct of '.Jane bought a book for'
'sleep a lot.' is nsubj of 'was born in Madrid.the cat was'
'was born in Madrid.the cat was' is ROOT of 'was born in Madrid.the cat was'
'by' is det of 'dog.I was'
'dog.I was' is dobj of 'was born in Madrid.the cat was'
'in Madrid during' is dative of 'was born in Madrid.the cat was'
'of all' is pobj of 'in Madrid during'
'this' is punct of 'was born in Madrid.the cat was'
', something good' is nsubj of 'come.Susan left after the rehearsal'
'come.Susan left after the rehearsal' is ROOT of 'come.Susan left after the rehearsal'
'She' is det of 'it well.She'
'it well.She' is dobj of 'come.Susan left after the rehearsal'
'during the' is prep 

## Visualizing dependency parsing

In [151]:
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_Madrid = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # ['She', 'sleep', 'GPE']
print(ent_Madrid)  # ['Madrid', 'I', 'GPE']

[('Jane', 0, 4, 'PERSON'), ('Jane', 27, 31, 'PERSON'), ('Madrid', 180, 186, 'GPE'), ('1995.Out', 194, 202, 'CARDINAL'), ('Susan', 242, 247, 'PERSON'), ('the morning', 308, 319, 'TIME')]
['Jane', 'B', 'PERSON']
['bought', 'O', '']


In [154]:
import spacy
from spacy.tokens import Span

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)


Before [('Jane', 0, 4, 'ORG'), ('Jane', 27, 31, 'PERSON'), ('Madrid', 180, 186, 'GPE'), ('1995.Out', 194, 202, 'CARDINAL'), ('Susan', 242, 247, 'PERSON'), ('the morning', 308, 319, 'TIME')]


In [155]:
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
print("Before", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
print("After", doc.ents)  # [London]



Before (Jane, Jane, Madrid, 1995.Out, Susan, the morning)
After (Jane,)


In [164]:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']

# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
print("Before", doc.ents)  # []
print([t.text for t in doc]) # ['mother-in-law']

['mother', '-', 'in', '-', 'law']
Before (Jane, books.Jane, Madrid, 1995.Out, the morning)
['Jane', 'bought', 'me', 'these', 'books.Jane', 'bought', 'a', 'book', 'for', 'me.She', 'dropped', 'a', 'line', 'to', 'him', '.', 'Thank', 'you.She', 'sleeps.I', 'sleep', 'a', 'lot.I', 'was', 'born', 'in', 'Madrid.the', 'cat', 'was', 'chased', 'by', 'the', 'dog.I', 'was', 'born', 'in', 'Madrid', 'during', '1995.Out', 'of', 'all', 'this', ',', 'something', 'good', 'will', 'come.Susan', 'left', 'after', 'the', 'rehearsal', '.', 'She', 'did', 'it', 'well.She', 'sleeps', 'during', 'the', 'morning', ',', 'but', 'she', 'sleeps', '.']


In [165]:
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps.")
print("Before", doc.ents)  # []
print([token.text for token in doc])


Before ()
['Jane', 'bought', 'me', 'these', 'books.Jane', 'bought', 'a', 'book', 'for', 'me.She', 'dropped', 'a', 'line', 'to', 'him.', 'Thank', 'you.She', 'sleeps.I', 'sleep', 'a', 'lot.I', 'was', 'born', 'in', 'Madrid.the', 'cat', 'was', 'chased', 'by', 'the', 'dog.I', 'was', 'born', 'in', 'Madrid', 'during', '1995.Out', 'of', 'all', 'this', ',', 'something', 'good', 'will', 'come.Susan', 'left', 'after', 'the', 'rehearsal.', 'She', 'did', 'it', 'well.She', 'sleeps', 'during', 'the', 'morning,', 'but', 'she', 'sleeps.']


In [32]:
table = BeautifulTable()
table.columns.header = ["text token", "POS" , "TAG", "Dep", "Shape", "is_alpha", "is_stop"]
for token in doc:
  table.rows.append([token.text, token.pos_, token.tag_,token.dep_, token.shape_,token.is_alpha, token.is_stop ])
print(table)

+------------+-------+-----+-----------+-----------+----------+---------+
| text token |  POS  | TAG |    Dep    |   Shape   | is_alpha | is_stop |
+------------+-------+-----+-----------+-----------+----------+---------+
|     "      | PUNCT | ``  |   punct   |     "     |    0     |    0    |
+------------+-------+-----+-----------+-----------+----------+---------+
|    Jane    | PROPN | NNP |   nsubj   |   Xxxx    |    1     |    0    |
+------------+-------+-----+-----------+-----------+----------+---------+
|   bought   | VERB  | VBD |   ROOT    |   xxxx    |    1     |    0    |
+------------+-------+-----+-----------+-----------+----------+---------+
|     me     | PRON  | PRP |  dative   |    xx     |    1     |    1    |
+------------+-------+-----+-----------+-----------+----------+---------+
|   these    |  DET  | DT  |    det    |   xxxx    |    1     |    1    |
+------------+-------+-----+-----------+-----------+----------+---------+
|   books    | NOUN  | NNS |   dobj   

## es el proceso de extraccion de dependencia analizando la sentencia que representa la estructura gramatical. esto define la relaciones de dependencia entre palabras clave y sus dependencias

In [33]:
from spacy import displacy

In [34]:
options = {"compact":True, "distance": 100, "color": "white", "bg": "#09a3d5", "font": "Comic"}
displacy.render(doc, style="dep", jupyter=True, options= options)

In [35]:
displacy.render(doc, style="dep", jupyter=True, options= {"compact": True})

## SEntence boundary Detection

Es detectar el comienzo y el fin de la sentencia en un texto dado

In [36]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!pithon -m spacy download en_core_web_sm
!pip install beautifultable

/bin/bash: line 1: pithon: command not found


In [37]:
import spacy

In [38]:
nlp = spacy.load("en_core_web_sm")

In [39]:
text = ('''"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."''')

In [40]:
doc = nlp (text)
doc

"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."

## no se vio si los datos de un texto son un parrafo, luego se etiquetan las partes de un discurso y se analiza la dependencia

In [41]:
doc.text

'"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."'

In [42]:
list(doc)

[",
 Jane,
 bought,
 me,
 these,
 books,
 .,
 Jane,
 bought,
 a,
 book,
 for,
 me,
 .,
 She,
 dropped,
 a,
 line,
 to,
 him,
 .,
 Thank,
 you,
 .,
 She,
 sleeps,
 .,
 I,
 sleep,
 a,
 lot,
 .,
 I,
 was,
 born,
 in,
 Madrid.the,
 cat,
 was,
 chased,
 by,
 the,
 dog,
 .,
 I,
 was,
 born,
 in,
 Madrid,
 during,
 1995.Out,
 of,
 all,
 this,
 ,,
 something,
 good,
 will,
 come,
 .,
 Susan,
 left,
 after,
 the,
 rehearsal,
 .,
 She,
 did,
 it,
 well,
 .,
 She,
 sleeps,
 during,
 the,
 morning,
 ,,
 but,
 she,
 sleeps,
 .,
 "]

In [43]:
list(doc.sents)

["Jane bought me these books.,
 Jane bought a book for me.,
 She dropped a line to him.,
 Thank you.,
 She sleeps.,
 I sleep a lot.,
 I was born in Madrid.the cat was chased by the dog.,
 I was born in Madrid during 1995.Out of all this , something good will come.,
 Susan left after the rehearsal.,
 She did it well.,
 She sleeps during the morning, but she sleeps."]

In [44]:
sents = list(doc.sents)

In [45]:
sents

["Jane bought me these books.,
 Jane bought a book for me.,
 She dropped a line to him.,
 Thank you.,
 She sleeps.,
 I sleep a lot.,
 I was born in Madrid.the cat was chased by the dog.,
 I was born in Madrid during 1995.Out of all this , something good will come.,
 Susan left after the rehearsal.,
 She did it well.,
 She sleeps during the morning, but she sleeps."]

In [46]:
len(sents)

11

In [47]:
for sent in sents:
  print(sent)### sentences

"Jane bought me these books.
Jane bought a book for me.
She dropped a line to him.
Thank you.
She sleeps.
I sleep a lot.
I was born in Madrid.the cat was chased by the dog.
I was born in Madrid during 1995.Out of all this , something good will come.
Susan left after the rehearsal.
She did it well.
She sleeps during the morning, but she sleeps."


### Stop Words- palabras vacias

## son las palabras mas comunes en el lenguaje en ingles, the, are, but, and they

## I am eating - I- eating
## I was eating- I.-eating

In [48]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [49]:
len(stopwords)

326

In [50]:
stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [51]:
###stopwords = list(nlp.Defaults.stop_words)

In [52]:
len(sents)

11

In [53]:
list(doc.sents)

["Jane bought me these books.,
 Jane bought a book for me.,
 She dropped a line to him.,
 Thank you.,
 She sleeps.,
 I sleep a lot.,
 I was born in Madrid.the cat was chased by the dog.,
 I was born in Madrid during 1995.Out of all this , something good will come.,
 Susan left after the rehearsal.,
 She did it well.,
 She sleeps during the morning, but she sleeps."]

In [54]:
for sent in sents:
  print(sent.text)
for token in sent:
  if token.text not in stopwords:
   print(token)
   break

"Jane bought me these books.
Jane bought a book for me.
She dropped a line to him.
Thank you.
She sleeps.
I sleep a lot.
I was born in Madrid.the cat was chased by the dog.
I was born in Madrid during 1995.Out of all this , something good will come.
Susan left after the rehearsal.
She did it well.
She sleeps during the morning, but she sleeps."
She


In [55]:
for sent in sents:
  print(sent.text)
for token in sent:
  if token.text  in stopwords:## los token esten dentro de las stopwords
   print(token)
   break

"Jane bought me these books.
Jane bought a book for me.
She dropped a line to him.
Thank you.
She sleeps.
I sleep a lot.
I was born in Madrid.the cat was chased by the dog.
I was born in Madrid during 1995.Out of all this , something good will come.
Susan left after the rehearsal.
She did it well.
She sleeps during the morning, but she sleeps."
during


In [56]:
len(token.text)

6

### Lemmatizacion

## es el proceso de reducir formas inflexionadas de palabras mientras se reducen las formas raiz llamadas lemas

##playing.--play
##reducir las palabras a su forma raiz

In [57]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!pithon -m spacy download en_core_web_sm
!pip install beautifultable

/bin/bash: line 1: pithon: command not found


In [58]:
import spacy

In [59]:
nlp = spacy.load("en_core_web_sm")

In [60]:
text = ('''"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."''')

In [61]:
doc = nlp(text)

In [62]:
doc

"Jane bought me these books.Jane bought a book for me.She dropped a line to him. Thank you.She sleeps.I sleep a lot.I was born in Madrid.the cat was chased by the dog.I was born in Madrid during 1995.Out of all this , something good will come.Susan left after the rehearsal. She did it well.She sleeps during the morning, but she sleeps."

In [63]:
!pip install beautifultable



In [64]:
from beautifultable import BeautifulTable

In [65]:
for token in doc:
  print(token.text, token.lemma_)

" "
Jane Jane
bought buy
me I
these these
books book
. .
Jane Jane
bought buy
a a
book book
for for
me I
. .
She she
dropped drop
a a
line line
to to
him he
. .
Thank thank
you you
. .
She she
sleeps sleep
. .
I I
sleep sleep
a a
lot lot
. .
I I
was be
born bear
in in
Madrid.the Madrid.the
cat cat
was be
chased chase
by by
the the
dog dog
. .
I I
was be
born bear
in in
Madrid Madrid
during during
1995.Out 1995.out
of of
all all
this this
, ,
something something
good good
will will
come come
. .
Susan Susan
left leave
after after
the the
rehearsal rehearsal
. .
She she
did do
it it
well well
. .
She she
sleeps sleep
during during
the the
morning morning
, ,
but but
she she
sleeps sleep
. .
" "


In [66]:
table = BeautifulTable()
table.columns.header = ["text token", "token.lemma_"]
for token in doc:
  table.rows.append([token.text, token.lemma_])
print(table)

+------------+--------------+
| text token | token.lemma_ |
+------------+--------------+
|     "      |      "       |
+------------+--------------+
|    Jane    |     Jane     |
+------------+--------------+
|   bought   |     buy      |
+------------+--------------+
|     me     |      I       |
+------------+--------------+
|   these    |    these     |
+------------+--------------+
|   books    |     book     |
+------------+--------------+
|     .      |      .       |
+------------+--------------+
|    Jane    |     Jane     |
+------------+--------------+
|   bought   |     buy      |
+------------+--------------+
|     a      |      a       |
+------------+--------------+
|    book    |     book     |
+------------+--------------+
|    for     |     for      |
+------------+--------------+
|     me     |      I       |
+------------+--------------+
|     .      |      .       |
+------------+--------------+
|    She     |     she      |
+------------+--------------+
|  dropped

In [67]:
for i in token.text:
    if token.text != token.lemma_:
        print(i)