# Extracción de información del texto

### Algunas Posibilidades para extracción de texto
![Posibilidades para extracción de texto](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/hearst_patterns-768x435.png)

### Se importan los paquetes necesarios

In [1]:
#!pip install spacy

In [2]:
import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 200)

## Se carga el modelo pre-entrenado

Este modelo nos indica el tipo de cada palabra del texto 

In [3]:
import en_core_web_sm
nlp = en_core_web_sm.load()

## Se crea un texto para extraer la información

In [4]:
# sample text 
text = "GDP in developing countries such as Vietnam will continue growing at a high rate." 

# create a spaCy object 
doc = nlp(text)   #Se predice el texto, usando el modelo nlp cargado

### Imprimir el tipo de cada palabra del texto

In [5]:
# print token, dependency, POS tag 
for tok in doc:
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

GDP --> nsubj --> PROPN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> SCONJ
Vietnam --> pobj --> PROPN
will --> aux --> VERB
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
a --> det --> DET
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


### Se define el patrón de texto a extraer

In [6]:
#define the pattern 
pattern = [{'POS':'NOUN'}, 
           {'LOWER': 'such'}, 
           {'LOWER': 'as'}, 
           {'POS': 'PROPN'}] #proper noun

### Se busca el patrón y se extrae esta información del texto

In [7]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

countries such as Vietnam


## Otro ejemplo con fechas y patrones separados

### Se escribe el nuevo texto y se muestra el tipo de cada palabra

In [8]:
doc = nlp("Cristiano is my friend since high school and he was born in 1985.") 
# print dependency tags and POS tags
for tok in doc: 
  print(tok.text, "-->",tok.dep_, "-->",tok.pos_)

Cristiano --> nsubj --> PROPN
is --> ROOT --> AUX
my --> poss --> DET
friend --> attr --> NOUN
since --> prep --> SCONJ
high --> amod --> ADJ
school --> pobj --> NOUN
and --> cc --> CCONJ
he --> nsubjpass --> PRON
was --> auxpass --> AUX
born --> ROOT --> VERB
in --> prep --> ADP
1985 --> pobj --> NUM
. --> punct --> PUNCT


### Se definen 2 patrones a buscar, se buscan y finalmente se unen

In [9]:
# Matcher class object 
matcher = Matcher(nlp.vocab) 

#define the pattern 

#Defino un nombre propio
pattern1 = [{'POS':'PROPN'}] 

#Defino un auxiliar seguido por 'born in', seguido por una fecha
pattern2 = [{'POS':'AUX', 'OP':"?"},
           {'LOWER': 'born'}, 
           {'LOWER': 'in'}, 
           {'POS': 'NUM'}] 


matcher.add("matching_1", None, pattern1) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 
#___________________________________________________
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern2)
matches = matcher(doc) 
span2 = doc[matches[0][1]:matches[0][2]] 

## Se imprime lo que encontró

In [10]:
print(span.text, span2.text)

Cristiano was born in 1985


Basado en *https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/*