# NPL con Spacy

*(Trabajo en paralelo con dos archivos: uno en inglés y otro en español)*

0. Librerías previamente instaladas

<li>Instalación del framework:

    !pip install spacy
<li>Modelo/librería para inglés:

    !python -m spacy download en_core_web_sm
<li>Modelo/librería para español:    

    !python -m spacy download es_core_news_sm

1. Importar/Cargar framework + librería en inglés/español

In [3]:
import spacy

In [4]:
nlp_en = spacy.load("en_core_web_sm")
nlp_es = spacy.load("es_core_news_sm")

2. Carga de archivos

In [5]:
with open ("data/wiki_us.txt", "r") as f:
    text_en = f.read()

In [6]:
with open ("data/La_bici_de_Eloísa.txt", "r") as f:
    text_es = f.read()

(Vista "general")

In [15]:
print (f"=================\nTEXTO EN INGLÉS:\n=================\n\n{text_en}\n\n=================\nTEXTO EN ESPAÑOL:\n=================\n\n{text_es}")

TEXTO EN INGLÉS:

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen Br

In [16]:
doc_en = nlp_en(text_en)
doc_es = nlp_es(text_es)

Acá los podrías imprimir pero la vista es igual.

La longitud del doc es menor porque los contenedores almancenan palabras.

Incluso los paréntesis son contados como palabras.

El contendedor (token) de "text" es letra por letra:

In [25]:
print ("Longitud de los contenedores doc vs. el texto:\n\n")
print (f"Long. de text_en:{len(text_en)}\nLong. de doc_en:{len(doc_en)}\n")
print (f"Long. de text_es:{len(text_es)}\nLong. de doc_es:{len(doc_es)}\n")

Longitud de los contenedores doc vs. el texto:


Long. de text_en:3525
Long. de doc_en:652

Long. de text_es:3937
Long. de doc_es:889



Cuando imprimimos el token del texto directamente, cada token representa un caracter/espacio/signo.

In [31]:
for token in text_en[:7]:
    print (token)

T
h
e
 
U
n
i


Cuando imprimimos el token del contenedor doc, cada token representa una palabra.

Los paréntesis, por ejemplos, son reconocidos como tokens dentro del contenedor.

In [32]:
for token in doc_en[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)


A continuación, los paréntesis no serán removidos ni tenidos encuenta individualmente, desde el texto.

In [33]:
for token in text_en.split()[:10]:
    print (token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [8]:
words = text.split()[:10]

In [9]:
for token in words[:10]:
    print (token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [10]:
i=5
for token in doc[i:8]:
    print (f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i=i+1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




In [11]:
for sent in doc.sents:
    print (sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

That is because the sents attribute is a generator. In python, we can usually iterate over generators by converting them into a list. So, let’s do that.

In [13]:
sentence1 = list(doc.sents)[0]
print (sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


2.4. Token Attributes¶
The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

.text

.head

.left_edge

.right_edge

.ent_type_

.iob_

.lemma_

.morph

.pos_

.dep_

.lang_

In [14]:
token2 = sentence1[2]
print (token2)

States


In [15]:
token2.text

'States'

In [16]:
token2.head

is

This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.

In [18]:
token2.left_edge

The

In [19]:
token2.right_edge

,

In [20]:
token2.lang_

'en'

In [23]:
from spacy import displacy
displacy.render(sentence1, style="dep")

In [21]:
for ent in doc.ents:
    print (ent.text, ent.label_)


The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
third- or fourth CARDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million MONEY
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775â€“1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishâ€“American War and World War I EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean

In [24]:
displacy.render(doc, style="ent")