# Jugando con la gramática

Temas a tratar:

* Contando sustantivos (plurales y singulares).
* Obtener el análisis de dependencia.
* División de frases en cláusulas.
* Extracción de trozos sustantivos.
* Entidades extractoras y relaciones.
* Extraer sujetos y objetos de la oración.
* Encontrar referencias.

## Contar sustantivos - sustantivos plurales y singulares

In [1]:
# Importando los módulos necesarios
import nltk
from nltk.stem import WordNetLemmatizer
import inflect

In [3]:
def tokenize_nltk(text):
    return nltk.tokenize.word_tokenize(text)

def pos_tag_nltk(text):
    words = tokenize_nltk(text)
    words_with_pos = nltk.pos_tag(words)
    return words_with_pos


In [4]:
# Lectura del texto en inglés
filename = "pv.txt"
file = open(filename, "r", encoding="utf-8")
text = file.read()
text = text.replace("\n", " ")

In [5]:
# Hacer parte del etiquetado del habla
words_with_pos = pos_tag_nltk(text)
print(words_with_pos)

[('A', 'DT'), ('photovoltaic', 'JJ'), ('array', 'NN'), ('(', '('), ('PVA', 'NNP'), (')', ')'), ('simulation', 'NN'), ('model', 'NN'), ('to', 'TO'), ('be', 'VB'), ('used', 'VBN'), ('in', 'IN'), ('Matlab-Simulink', 'NNP'), ('GUI', 'NNP'), ('environment', 'NN'), ('is', 'VBZ'), ('developed', 'VBN'), ('and', 'CC'), ('presented', 'VBN'), ('in', 'IN'), ('this', 'DT'), ('paper', 'NN'), ('.', '.'), ('The', 'DT'), ('model', 'NN'), ('is', 'VBZ'), ('developed', 'VBN'), ('using', 'VBG'), ('basic', 'JJ'), ('circuit', 'NN'), ('equations', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('photovoltaic', 'NN'), ('(', '('), ('PV', 'NNP'), (')', ')'), ('solar', 'VBP'), ('cells', 'NNS'), ('including', 'VBG'), ('the', 'DT'), ('effects', 'NNS'), ('of', 'IN'), ('solar', 'JJ'), ('irradiation', 'NN'), ('and', 'CC'), ('temperature', 'NN'), ('changes', 'NNS'), ('.', '.'), ('The', 'DT'), ('new', 'JJ'), ('model', 'NN'), ('was', 'VBD'), ('tested', 'VBN'), ('using', 'VBG'), ('a', 'DT'), ('directly', 'RB'), ('coupled', 'VBN'), 

In [6]:
# Definimos la función get_nouns, para filtrar los sustantivos:
def get_nouns(words_with_pos):
    noun_set = ["NN", "NNS"]
    nouns = [word for word in words_with_pos if  word[1] in noun_set]
    return nouns

In [7]:
# Ejecutamos la función anterior en la lista de palabras etiquetadas con POS:
nouns = get_nouns(words_with_pos)
print(nouns)

[('array', 'NN'), ('simulation', 'NN'), ('model', 'NN'), ('environment', 'NN'), ('paper', 'NN'), ('model', 'NN'), ('circuit', 'NN'), ('equations', 'NNS'), ('photovoltaic', 'NN'), ('cells', 'NNS'), ('effects', 'NNS'), ('irradiation', 'NN'), ('temperature', 'NN'), ('changes', 'NNS'), ('model', 'NN'), ('dc', 'NN'), ('load', 'NN'), ('load', 'NN'), ('inverter', 'NN'), ('validation', 'NN'), ('studies', 'NNS'), ('load', 'NN'), ('circuits', 'NNS'), ('results', 'NNS')]


Para determinar si un sustantivo es singular o plural, tenemos dos opciones. La primera opción es usar las etiquetas NLTK, donde NN indica un sustantivo singular y NNS indica un sustantivo plural. <br>

La siguiente función utiliza las etiquetas NLTK y devuelve True si el sustantivo de entrada es plural:

In [8]:
def is_plural_nltk(noun_info):
    pos = noun_info[1]
    if (pos == "NNS"):
        return True
    else:
        return False

In [9]:
is_plural_nltk(('equations', 'NNS'))

True

In [10]:
is_plural_nltk(('photovoltaic', 'NN'))

False

La otra opción es usar la clase WordNetLemmatizer en el paquete nltk.stem. La siguiente función devuelve True si el sustantivo es plural:

In [11]:
def is_plural_wn(noun):
    wnl = WordNetLemmatizer()
    lemma = wnl.lemmatize(noun, 'n')
    plural = True if noun is not lemma else False
    return plural

In [12]:
is_plural_wn('equations')

True

In [13]:
is_plural_wn('photovoltaic')

False

In [14]:
# La siguiente función cambiará un sustantivo singular en plural:
def get_plural(singular_noun):
    p = inflect.engine()
    return p.plural(singular_noun)

In [15]:
get_plural('photovoltaic')

'photovoltaics'

In [16]:
# La siguiente función cambiará un sustantivo singular en plural:
def get_singular(plural_noun):
    p = inflect.engine()
    plural = p.singular_noun(plural_noun)
    if (plural):
        return plural
    else:
        return plural_noun

In [17]:
get_singular('equations')

'equation'

Ahora podemos usar las dos funciones anteriores para devolver una lista de sustantivos cambiados en plural o singular, dependiendo del sustantivo original. El siguiente código usa la función is_plural_wn para determinar si el sustantivo es plural. También puede usar la función is_plural_nltk:

In [18]:
def plurals_wn(words_with_pos):
    other_nouns = []
    for noun_info in words_with_pos:
        word = noun_info[0]
        plural = is_plural_wn(word)
        if (plural):
            singular = get_singular(word)
            other_nouns.append(singular)
        else:
            plural = get_plural(word)
            other_nouns.append(plural)
    
    return other_nouns

In [19]:
# Utilice la función anterior para devolver una lista de sustantivos modificados:
other_nouns_wn = plurals_wn(nouns)
print(other_nouns_wn)

['arrays', 'simulations', 'models', 'environments', 'papers', 'models', 'circuits', 'equation', 'photovoltaics', 'cell', 'effect', 'irradiations', 'temperatures', 'change', 'models', 'dcs', 'loads', 'loads', 'inverters', 'validations', 'study', 'loads', 'circuit', 'result']


## Obteniendo análisis de dependencia.

In [21]:
# Importamos librería
import spacy


'''
Recurrir al reporte de errores
'''
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

In [22]:
# Cargando la frase que se va a analizar
sentence = 'I have seldom heard him mention her under any other name.'

In [23]:
# Cargando el motor spacy
nlp = spacy.load('en_core_web_sm')

In [24]:
# Procesesando la oración con el motor spacy
doc = nlp(sentence)

La información de dependencias estará contenida en el objeto doc. Podemos ver las etiquetas de dependencias haciendo un bucle a través de los tokens en doc:

In [25]:
for token in doc:
    print(token.text, "\t", token.dep_, "\t", spacy.explain(token.dep_))

I 	 nsubj 	 nominal subject
have 	 aux 	 auxiliary
seldom 	 advmod 	 adverbial modifier
heard 	 ROOT 	 None
him 	 nsubj 	 nominal subject
mention 	 ccomp 	 clausal complement
her 	 dobj 	 direct object
under 	 prep 	 prepositional modifier
any 	 det 	 determiner
other 	 amod 	 adjectival modifier
name 	 pobj 	 object of preposition
. 	 punct 	 punctuation


Para explorar la estructura de análisis de dependencias, podemos usar los atributos de la clase Token. Usando sus atributos **ancestors** e **children**, podemos obtener los tokens de los que depende este token y los tokens que dependen de él, respectivamente.

In [26]:
for token in doc:
    print(token.text)
    ancestors = [t.text for t in token.ancestors]
    print(ancestors)

I
['heard']
have
['heard']
seldom
['heard']
heard
[]
him
['mention', 'heard']
mention
['heard']
her
['mention', 'heard']
under
['mention', 'heard']
any
['name', 'under', 'mention', 'heard']
other
['name', 'under', 'mention', 'heard']
name
['under', 'mention', 'heard']
.
['heard']


In [27]:
# tokens secundarios
for token in doc:
    print(token.text)
    children = [t.text for t in token.children]
    print(children)

I
[]
have
[]
seldom
[]
heard
['I', 'have', 'seldom', 'mention', '.']
him
[]
mention
['him', 'her', 'under']
her
[]
under
['name']
any
[]
other
[]
name
['any', 'other']
.
[]


In [28]:
# También podemos ver el subárbol en el que está el token:
for token in doc:
    print(token.text)
    subtree = [t.text for t in token.subtree]
    print(subtree)

I
['I']
have
['have']
seldom
['seldom']
heard
['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
him
['him']
mention
['him', 'mention', 'her', 'under', 'any', 'other', 'name']
her
['her']
under
['under', 'any', 'other', 'name']
any
['any']
other
['other']
name
['any', 'other', 'name']
.
['.']


**Probemos con español**

In [29]:
'''
python -m spacy download es_core_news_sm
'''
nlp = spacy.load("es_core_news_sm")

In [30]:
sentence = 'Me gusta la programación y el mundo maker.'

In [31]:
doc = nlp(sentence)
doc

Me gusta la programación y el mundo maker.

In [32]:
for token in doc:
    print(token.text, "\t", token.dep_, "\t", spacy.explain(token.dep_))

Me 	 obj 	 object
gusta 	 ROOT 	 None
la 	 det 	 determiner
programación 	 nsubj 	 nominal subject
y 	 cc 	 coordinating conjunction
el 	 det 	 determiner
mundo 	 conj 	 conjunct
maker 	 amod 	 adjectival modifier
. 	 punct 	 punctuation


In [33]:
for token in doc:
    print(token.text)
    ancestors = [t.text for t in token.ancestors]
    print(ancestors)

Me
['gusta']
gusta
[]
la
['programación', 'gusta']
programación
['gusta']
y
['mundo', 'programación', 'gusta']
el
['mundo', 'programación', 'gusta']
mundo
['programación', 'gusta']
maker
['mundo', 'programación', 'gusta']
.
['gusta']


## Dividir la frase en cláusulas

Cuando trabajamos con texto, frecuentemente tratamos con oraciones compuestas (oraciones con dos partes que son igualmente importantes) y complejas (oraciones con una parte dependiendo de otra).

In [34]:
# Importamos la librería
import spacy

In [35]:
# Cargamos el motor de spacy
nlp = spacy.load('en_core_web_sm')

In [36]:
# Definimos la oración
sentence = "He eats cheese, but he won't eat ice cream."

In [37]:
# Procesamos la oración con el motor:
doc = nlp(sentence)

In [38]:
# Poniendo atención a la estructura de la oración
for token in doc:
    ancestors = [t.text for t in token.ancestors]
    children = [t.text for t in token.children]
    print(token.text, "\t", token.i, "\t", token.pos_, "\t", token.dep_, "\t", 
 ancestors, "\t", children)

He 	 0 	 PRON 	 nsubj 	 ['eats'] 	 []
eats 	 1 	 VERB 	 ROOT 	 [] 	 ['He', 'cheese', ',', 'but', 'eat']
cheese 	 2 	 NOUN 	 dobj 	 ['eats'] 	 []
, 	 3 	 PUNCT 	 punct 	 ['eats'] 	 []
but 	 4 	 CCONJ 	 cc 	 ['eats'] 	 []
he 	 5 	 PRON 	 nsubj 	 ['eat', 'eats'] 	 []
wo 	 6 	 VERB 	 aux 	 ['eat', 'eats'] 	 []
n't 	 7 	 PART 	 neg 	 ['eat', 'eats'] 	 []
eat 	 8 	 VERB 	 conj 	 ['eats'] 	 ['he', 'wo', "n't", 'cream', '.']
ice 	 9 	 NOUN 	 compound 	 ['cream', 'eat', 'eats'] 	 []
cream 	 10 	 NOUN 	 dobj 	 ['eat', 'eats'] 	 ['ice']
. 	 11 	 PUNCT 	 punct 	 ['eat', 'eats'] 	 []


Usaremos la siguiente función para encontrar el token raíz de la oración, que suele ser el verbo principal. En los casos en que hay una cláusula dependiente, es el verbo de la cláusula independiente:

In [39]:
def find_root_of_sentence(doc):
    root_token = None
    for token in doc:
        if (token.dep_ == "ROOT"):
            root_token = token
    return root_token

In [40]:
# Encontramos el símbolo raíz de la frase:
root_token = find_root_of_sentence(doc)
root_token

eats

In [41]:
# Ahora podemos usar la siguiente función para encontrar 
## los otros verbos en la oración:
def find_other_verbs(doc, root_token):
    other_verbs = []
    for token in doc:
        ancestors = list(token.ancestors)
        if (token.pos_ == "VERB" and len(ancestors) == 1\
            and ancestors[0] == root_token):
            other_verbs.append(token)
    return other_verbs

In [42]:
# Utilice la función anterior para encontrar los verbos 
## restantes en la oración:
other_verbs = find_other_verbs(doc, root_token)
other_verbs

[eat]

In [43]:
# buscamos los "token spans" de cada verbo:
def get_clause_token_span_for_verb(verb, doc, all_verbs):
    first_token_index = len(doc)
    last_token_index = 0
    this_verb_children = list(verb.children)
    for child in this_verb_children:
        if (child not in all_verbs):
            if (child.i < first_token_index):
                first_token_index = child.i
            if (child.i > last_token_index):
                last_token_index = child.i
    return(first_token_index, last_token_index)

Reuniremos todos los verbos en una matriz y procesaremos cada uno usando la función precedente. Esto devolverá una tupla de índices de inicio y fin para la cláusula de cada verbo:

In [44]:
token_spans = []
all_verbs = [root_token] + other_verbs
for other_verb in all_verbs:
    (first_token_index, last_token_index) = \
    get_clause_token_span_for_verb(other_verb,doc, all_verbs)
    token_spans.append((first_token_index, last_token_index))

In [45]:
# Armamos los rangos de tokens para cada cláusula:
sentence_clauses = []
for token_span in token_spans:
    start = token_span[0]
    end = token_span[1]
    if (start < end):
        clause = doc[start:end]
        sentence_clauses.append(clause)
sentence_clauses = sorted(sentence_clauses,
                          key=lambda tup: tup[0])
print(sentence_clauses)

[He eats cheese,, he won't eat ice cream]


In [46]:
# Imprimimos el resultado final:
clauses_text = [clause.text for clause in sentence_clauses]
print(clauses_text)

['He eats cheese,', "he won't eat ice cream"]


## Extracción de trozos sustantivos

**noun phrases** Los trozos sustantivos se conocen en la lingüística como frases sustantivos

In [47]:
# Importamos el texto usando la siguiente función:
def read_text_file(filename):
    file = open(filename, "r", encoding="utf-8") 
    return file.read()

In [48]:
# Cargamos el texto:
text = read_text_file("pv.txt")

# Reemplazando los saltos de línea con espacios:
text = text.replace("\n", " ")

In [49]:
# Inicializamos el motor Spacy
nlp = spacy.load('en_core_web_md')
doc = nlp(text)
doc

A photovoltaic array (PVA) simulation model to be used in Matlab-Simulink GUI environment is developed and presented in this paper. The model is developed using basic circuit equations of the photovoltaic (PV) solar cells including the effects of solar irradiation and temperature changes. The new model was tested using a directly coupled dc load as well as ac load via an inverter. Test and validation studies with proper load matching circuits are simulated and results are presented here.

In [50]:
# Los fragmentos sustantivos están contenidos en la
## variable de clase doc.noun_chunks. 
### Podemos imprimir los trozos
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text)

A photovoltaic array (PVA) simulation model
Matlab-Simulink GUI environment
this paper
The model
basic circuit equations
the photovoltaic (PV) solar cells
the effects
solar irradiation and temperature changes
The new model
a directly coupled dc load
ac load
an inverter
Test and validation studies
proper load matching circuits
results


**Hay más**

Los fragmentos sustantivos son objetos spaCy Span y tienen todas sus propiedades. Ver el sitio oficial 
documentación en https://spacy.io/api/token.

In [51]:
# Importamos Spacy
import spacy
# Cargamos el motor:
nlp = spacy.load('en_core_web_sm')

In [52]:
# Definimos una oración de ejemplo:
sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."

In [53]:
# Procesamos la oración con el motor:
doc = nlp(sentence)

In [54]:
#Veamos los fragmentos sustantivos de esta frase:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text)

All emotions
one
his cold, precise but admirably balanced mind


In [55]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text, "\t", noun_chunk.start, "\t", noun_chunk.end)

All emotions 	 0 	 2
one 	 5 	 6
his cold, precise but admirably balanced mind 	 11 	 19


In [56]:
# También podemos imprimir la frase donde pertenece el sustantivo chunk:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text, "\t", noun_chunk.sent)

All emotions 	 All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
one 	 All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
his cold, precise but admirably balanced mind 	 All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.


In [57]:
'''
Al igual que una oración, cualquier fragmento sustantivo incluye una raíz, 
que es el token del que dependen todos los demás tokens. 
En una frase sustantivo, ese es el sustantivo:
'''
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text, "\t", noun_chunk.root.text)

All emotions 	 emotions
one 	 one
his cold, precise but admirably balanced mind 	 mind


Otra propiedad muy útil de **Span** es **similirity**, que es la similitud semántica de diferentes textos. Vamos a probarlo. Vamos a cargar otro sustantivo **emotions**, y procesarlo usando **spacy**:

In [58]:
other_span = "emotions"
other_doc = nlp(other_span)

In [59]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.similarity(other_doc))

0.44171491123509243
-0.10580340746622595
0.022172331323610288


  


In [60]:
# Corregimos el aviso anterior:
nlp = spacy.load('en_core_web_md')
print(noun_chunk.similarity(other_doc))

0.022172331323610288


  This is separate from the ipykernel package so we can avoid doing imports until


## Entidades extractoras y relaciones

In [84]:
# Importamos las librerías necesarias
import spacy
import textacy

In [61]:
# Importamos la función que vamos a usar
def find_root_of_sentence(doc):
    root_token = None
    for token in doc:
        if (token.dep_ == "ROOT"):
            root_token = token
    return root_token

In [62]:
# Cargamos el motor de spacy
nlp = spacy.load('en_core_web_sm')

In [63]:
# Definimos las oraciones que vamos a procesar
sentences = ["All living things are made of cells.",  "Cells have organelles."]

In [64]:
verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"},  {"POS":"ADP"}],  [{"POS":"AUX"}]]

La función **contains_root** comprueba si una frase verbal contiene la raíz de la frase:

In [65]:
def contains_root(verb_phrase, root):
    vp_start = verb_phrase.start
    vp_end = verb_phrase.end
    if (root.i >= vp_start and root.i <= vp_end):
        return True
    else:
        return False

La función **get_verb_phrases** obtiene las frases verbales de un objeto spaCy Doc:

In [66]:
def get_verb_phrases(doc):
    root = find_root_of_sentence(doc)
    verb_phrases = textacy.extract.matches(doc, verb_patterns)
    new_vps = []
    for verb_phrase in verb_phrases:
        if (contains_root(verb_phrase, root)):
            new_vps.append(verb_phrase)
            return new_vps

La función **longer_verb_phrase** encuentra la frase verbal más larga:

In [68]:
def longer_verb_phrase(verb_phrases):
    longest_length = 0
    longest_verb_phrase = None
    for verb_phrase in verb_phrases:
        if len(verb_phrase) > longest_length:
            longest_verb_phrase = verb_phrase
            return longest_verb_phrase

La función **find_noun_phrase** buscará frases de sustantivo en el lado izquierdo o derecho de la frase del verbo principal:

In [69]:
def find_noun_phrase(verb_phrase, noun_phrases, side):
    for noun_phrase in noun_phrases:
        if (side == "left" and noun_phrase.start < verb_phrase.start):
            return noun_phrase
        elif (side == "right" and noun_phrase.start > verb_phrase.start):
            return noun_phrase

En esta función, utilizaremos las funciones anteriores para encontrar trillizos de objeto-relación en las oraciones:

In [70]:
def find_triplet(sentence):
    doc = nlp(sentence)
    verb_phrases = get_verb_phrases(doc)
    noun_phrases = doc.noun_chunks
    verb_phrase = None
    if (len(verb_phrases) > 1):
        verb_phrase = longer_verb_phrase(list(verb_phrases))
    else:
        verb_phrase = verb_phrases[0]
        left_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "left")
        right_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "right")
    return (left_noun_phrase, verb_phrase, right_noun_phrase)

In [71]:
# Usamos un bucle para encontrar la relación trillizos
for sentence in sentences:
    (left_np, vp, right_np) = find_triplet(sentence)
    print(left_np, "\t", vp, "\t", right_np)

NameError: name 'textacy' is not defined