Importaciones

In [15]:
import nltk  
import numpy as np  
import random  
import string

import bs4 as bs  
import urllib.request  
import re 
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Obtención de datos (En este caso de la Wikipedia, para NLP)

In [12]:
def extrear_contenido_url(url):
  respuesta = urllib.request.urlopen(url)  # Obtenemos la respuesta de los datos desde la URL
  content = respuesta.read() # Leemos el contenido de la respuesta

  html = bs.BeautifulSoup(content, 'lxml') # Parseamos y extraemos el HTML

  parrafos = html.find_all('p') # Extraemos los parrafos

  texto = '' # Inicializamos el texto vacio

  for parrafo in parrafos:  # Por cada parrafo, amadimos el contenido
      texto += parrafo.text
  return texto

In [13]:
texto = extrear_contenido_url('https://en.wikipedia.org/wiki/Natural_language_processing')
print(texto)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulate

Tokenizamos en frases

In [18]:
frases = nltk.sent_tokenize(texto)  
print(frases)

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.', 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.', 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.', 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.', 'Natural language processing has its roots in the 1950s.', 'Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that wa

Ahora vamos a limpiar el corpus haciendo las siguientes tareas:
- Pasandolo a minuscula
- Eliminando caracteres que no sean palabras
- Eliminando espacios vacios

In [23]:
def limpiar_corpus(corpus):
  aux = corpus
  for i in range(len(aux )):  
    aux[i] = aux[i].lower() # Pasar a minúscula
    aux[i] = re.sub(r'\W',' ',aux[i]) # Limpiar caracteres no alfanuméricos
    aux[i] = re.sub(r'\s+',' ',aux[i]) # Dejar en un solo espacio, todos los espacios consecutivos
  return aux

In [24]:
frases_limpias = limpiar_corpus(frases) # Limpiamos las frases con la función que hemos creado
print(frases_limpias) # Las mostramos

['natural language processing nlp is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data ', 'the goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them ', 'the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves ', 'challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation ', 'natural language processing has its roots in the 1950s ', 'already in 1950 alan turing published an article titled computing machinery and intelligence which proposed what is now called the turing test as a criterion of intelligence though at the time that was not articula

Visualizamos los datos

In [25]:
indice_aleatorio_frase = random.randint(0,len(frases_limpias)) # Obtenemos una frase aleatoria
print(frases_limpias[indice_aleatorio_frase]) # La mostramos

as an example george lakoff offers a methodology to build natural language processing nlp algorithms through the perspective of cognitive science along with the findings of cognitive linguistics 40 with two defining aspects ties with cognitive linguistics are part of the historical heritage of nlp but they have been less frequently addressed since the statistical turn during the 1990s 


**Creación del diccionario**: Del tipo, palabra --> frecuencia que encontramos esa palabra en el corpus

In [27]:
def crear_diccionario_frecuencias(texto):
  diccionario_frecuencia = {}
  for frase in texto:  # Para cada frase
      tokens = nltk.word_tokenize(frase) # Extraemos las palabras
      for token in tokens:# Para cada palabra
          if token not in diccionario_frecuencia.keys(): # Si no existe en el diccionario
              diccionario_frecuencia[token] = 1 # La pongo e inicializo la frecuencia a 1
          else: # Si existe
              diccionario_frecuencia[token] += 1 # Añado 1 a la frecuencia
  return diccionario_frecuencia


In [40]:
diccionario_frecuencias = crear_diccionario_frecuencias(frases_limpias)
print(diccionario_frecuencias)
print('Tamaño diccionario',len(diccionario_frecuencias))

{'natural': 20, 'language': 28, 'processing': 16, 'nlp': 17, 'is': 15, 'a': 25, 'subfield': 1, 'of': 68, 'linguistics': 9, 'computer': 4, 'science': 3, 'and': 30, 'artificial': 2, 'intelligence': 4, 'concerned': 1, 'with': 10, 'the': 69, 'interactions': 1, 'between': 1, 'computers': 2, 'human': 2, 'in': 29, 'particular': 1, 'how': 2, 'to': 29, 'program': 1, 'process': 2, 'analyze': 2, 'large': 3, 'amounts': 1, 'data': 5, 'goal': 1, 'capable': 1, 'understanding': 4, 'contents': 1, 'documents': 4, 'including': 1, 'contextual': 1, 'nuances': 1, 'within': 1, 'them': 1, 'technology': 1, 'can': 5, 'then': 2, 'accurately': 1, 'extract': 1, 'information': 1, 'insights': 1, 'contained': 1, 'as': 16, 'well': 2, 'categorize': 1, 'organize': 1, 'themselves': 1, 'challenges': 1, 'frequently': 2, 'involve': 2, 'speech': 5, 'recognition': 2, 'generation': 2, 'has': 6, 'its': 2, 'roots': 1, '1950s': 1, 'already': 1, '1950': 1, 'alan': 1, 'turing': 2, 'published': 1, 'an': 5, 'article': 1, 'titled': 1,

Ahora vamos a generar de forma manual una representación vectorial de las frases, de forma que si las palabras de la lista de palabras frecuentes existen en la frase, añadimos un 1 y si no un 0

In [53]:
def codificacion_frases_por_palabras_frecuentes(diccionario,dim=200):
  import heapq  
  most_freq = heapq.nlargest(200, diccionario_frecuencias, key=diccionario_frecuencias.get)  
  sentence_vectors = []  # Vector para la frase
  for sentence in frases_limpias:  # Para cada frase
      sentence_tokens = nltk.word_tokenize(sentence) # Tokenización en palabras
      sent_vec = [] # Vector para la palabra
      for token in most_freq: # Para cada palabra mas frecuente
          if token in sentence_tokens: # Si esta en la frase
              sent_vec.append(1) # Añadimos un 1 a el vector de la frase
          else:
              sent_vec.append(0) # Añadimos un 0 a el vector de la frase
      sentence_vectors.append(sent_vec) # Añadimos el vector de frase, al la lista de frases
  return most_freq,sentence_vectors

In [55]:
palabras_mas_frecuentes,frases_codificadas = codificacion_frases_por_palabras_frecuentes(diccionario_frecuencias,200)
print(palabras_mas_frecuentes)
print(frases_codificadas)

['the', 'of', 'and', 'in', 'to', 'language', 'a', 'natural', 'nlp', 'processing', 'as', 'is', 'that', 'machine', 'learning', 'cognitive', 'statistical', 'for', 'with', 'e', 'tasks', 'on', 'such', 'linguistics', 'rules', 'g', 'neural', 'have', 'are', 'models', 'based', 'more', 'has', 'by', 'systems', 'algorithms', 'methods', 'many', 'research', 'been', 'input', 'data', 'can', 'speech', 'an', 'which', 'was', 'however', 'be', 'real', 'they', 'computer', 'intelligence', 'understanding', 'documents', 'from', 'symbolic', 'given', 'or', 'hand', 'grammar', 'results', 'increasingly', 'when', 'used', 'since', 'some', 'science', 'large', 'proposed', 'task', '1980s', 'most', 'this', 'computational', 'deep', 'network', 'set', 'commonly', 'through', 'world', 'valued', 'networks', 'larger', 'part', 'use', 'approaches', 'translation', 'trends', 'aspects', 'artificial', 'computers', 'human', 'how', 'process', 'analyze', 'then', 'well', 'frequently', 'involve', 'recognition', 'generation', 'its', 'turin

Lo pasamos a forma de matriz

In [57]:
frases_codificadas_matrix = np.asarray(frases_codificadas)  
print(frases_codificadas_matrix)

[[1 1 1 ... 0 0 0]
 [1 1 0 ... 1 1 0]
 [1 0 1 ... 0 0 1]
 ...
 [1 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [60]:
import pandas as pd
df = pd.DataFrame(frases_codificadas_matrix)
df.columns = palabras_mas_frecuentes
df

Unnamed: 0,the,of,and,in,to,language,a,natural,nlp,processing,...,amounts,goal,capable,contents,including,contextual,nuances,within,them,technology
0,1,1,1,1,1,1,1,1,1,1,...,1,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,0,0,0,...,0,1,1,1,1,1,1,1,1,0
2,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,1,1,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
5,1,1,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,1,1,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,1,1,0,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
8,1,1,0,0,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1,1,0,1,0,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
