# 12.- Procesamiento de Lenguaje Natural

### 12.0.1.- Instalando TextBlob

Lo primero será instalar TextBlob desde el prompt de anaconda

    conda install -c confa-forge textblob

Una vez ejecutado, corre el siguiente comando

ipython -m textblob.download_corpora

## 12.2.- TextBlob

### 12.2.2.- Crear un TextBlob

Vamos a crear un textblob con el texto 'Y cuando despertó. Todo a su alrededor era luminoso'

In [3]:
from textblob import TextBlob

In [8]:
texto = """In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a very clear picture of the risks they're taking"""
blob = TextBlob(texto) #TextBlob >< String
blob

TextBlob("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a very clear picture of the risks they're taking")

In [5]:
print(t.sentences)

[Sentence("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy."), Sentence("The consumers who 
participate deserve a very clear picture of the risks they're taking")]


### 12.2.3.- Tokenizar

Ahora vamos a obtener una lista de oraciones

In [10]:
import nltk
nltk.download("punkt_tab")

blob.sentences

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Techie10\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[Sentence("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
 and San Francisco counties make an important point about the lightly regulated sharing economy."),
 Sentence("The consumers who 
 participate deserve a very clear picture of the risks they're taking")]

Y una lista de palabras

In [11]:
blob.words

WordList(['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', 'they', "'re", 'taking'])

Repetimos para 'I have good discipline to study. In the future I will be a great engineer'

In [12]:
blob2 = TextBlob('I have good discipline to study In the future I will be a great engineer')

In [13]:
blob2.words

WordList(['I', 'have', 'good', 'discipline', 'to', 'study', 'In', 'the', 'future', 'I', 'will', 'be', 'a', 'great', 'engineer'])

### 12.2.4.- Parte del habla

Ahora vamos a etiquetar cada parte de la oración en su categoría de sustantivo, pronotmbre, verbo, etc

In [None]:
#import nltk
#nltk.download("punkt_tab")
blob2.tags

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Techie10\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[('I', 'PRP'),
 ('have', 'VBP'),
 ('good', 'JJ'),
 ('discipline', 'NN'),
 ('to', 'TO'),
 ('study', 'VB'),
 ('In', 'IN'),
 ('the', 'DT'),
 ('future', 'NN'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('engineer', 'NN')]

### 12.2.5.- Frases de sustantivos

Le podemos pedir noun phrases a nuestro textblob

In [16]:
blob2.noun_phrases

WordList(['good discipline', 'great engineer'])

### 12.2.6.- Analisis de sentimientos

También podemos analizar el sentimento de una oración. Vamos dandole con l oración que traiamos

Ahora otro ejemplo = 'I am a bad student. I will not be able to finish my studies'

In [17]:
ejemplo = 'I am a bad student. I will not be able to finish my studies'

In [19]:
blob3 = TextBlob(ejemplo)

In [None]:
blob3.sentiment
# polarity / polaridad (negativa = malo / positiva = bueno)
#(le cuesta analizar sentimientos en frases juntas) subjetivity = % de que tanto está seguro de "polarity"

Sentiment(polarity=-0.09999999999999992, subjectivity=0.6458333333333333)

Y lo repetimos para el mismo ejemplo pero frase por frase

In [28]:
for sentence in blob3.sentences:
    print(sentence.sentiment)

Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)
Sentiment(polarity=0.5, subjectivity=0.625)


In [30]:
[sentence.sentiment for sentence in blob3.sentences]

[Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666),
 Sentiment(polarity=0.5, subjectivity=0.625)]

Se puede hacer lo mismo mediante NaiveBayes (en vez de el default pattern)

In [32]:
from textblob.sentiments import NaiveBayesAnalyzer

In [34]:
blob3 = TextBlob(texto, analyzer = NaiveBayesAnalyzer())

In [35]:
blob3

TextBlob("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a very clear picture of the risks they're taking")

In [36]:
blob3.sentiment

Sentiment(classification='pos', p_pos=0.9518646070826409, p_neg=0.04813539291736653)

## 12.3.- Detección de lenguaje y traducción

Escribe Bonjour y determina el lenguaje usando textblob

In [37]:
texto = "Bonjour"

In [39]:
blob = TextBlob(texto)

In [41]:
blob.detect_language()

AttributeError: 'TextBlob' object has no attribute 'detect_language'

In [None]:
#!pip install langdetect

#from langdetect import detect

text = "Hello"

print(detect(texto))

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---------------------------------------- 981.5/981.5 kB 10.1 MB/s  0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (pyproject.toml): started
  Building wheel for langdetect (pyproject.toml): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=1019154 sha256=b48d3a2b38d5021eb2fc83009e958f9cfd8cc5136ec4309052b97d927e1fe1d7
  Stored in directory: c:\users\techie10\appdata\local\pip\cache\wheels\13\60\5a\f2012969b3f7413b62b7

Ahora usa textblob para traducir "I have good discipline to study. In the future I will be a great engineer'

In [None]:
blob = TextBlob()

In [None]:
miFrase = blob.translate()

In [None]:
#!pip install deep-translator
from deep_translator import GoogleTranslator



In [49]:
texto = "I have good discipline to study In the future I will be a great engineer"
traduccion = GoogleTranslator(source="en",target= "es").translate(texto)

In [50]:
print(traduccion)

Tengo buena disciplina para estudiar En el futuro seré un gran ingeniero


### 12.3.1.- Inflección - pluralización y singularización

Textblob también entiende de singulares y plurales, obten el plural de party y el singular de lives

In [51]:
from textblob import Word

Pluraliza Potato, Tomato, Carrot

In [52]:
vegetal = Word("potato")
vegetal.pluralize()

'potatoes'

In [53]:
vegetal = Word("tomato")
vegetal.pluralize()

'tomatoes'

In [56]:
vegetal = Word("arbol")
vegetal.pluralize()

'arbols'

### 12.3.2.- Spell Check

Python tambien puede revisar ortografia, revisa que opina de la palabra whife

In [57]:
palabra = Word("nife")

In [59]:
palabra.spellcheck()

[('life', 0.6231369765791341),
 ('wife', 0.2604684173172463),
 ('nine', 0.0454222853087296),
 ('nice', 0.03761533002129169),
 ('knife', 0.029808374733853796),
 ('nile', 0.0021291696238466998),
 ('rife', 0.0014194464158978)]

También puedes revisar ortografía de una oración completa como 'Yestarday was a bab dai'

In [60]:
enunciado = TextBlob("Yestarday was a bab dai")

In [61]:
enunciado.correct()

TextBlob("Yesterday was a bad day")

In [63]:
enunciado = TextBlob('nawadais wi dont ja gud gaimes')

In [64]:
enunciado.correct()

TextBlob("nowadays i dont ja god games")

### 12.3.3.- Normalización

Se puede obtener  el stem y lemma de una palabra plural como dormitories

In [66]:
w1 = Word("men")
print("Word:",w1)
print("Singularize:",w1.singularize())

w2 = Word("running")
print("\nWord:",w2)
print("Singularize:",w2.lemmatize("v"))

Word: men
Singularize: man

Word: running
Singularize: run


### 12.3.4.- Frecuencias de Palabras

Que pasa si quieres conocer la frecuencia de una palabra?

Importa el texto de dracula.txt y cuenta las veces que aparece la palabra crucifix, dracula, blood

In [67]:
from pathlib import Path

In [68]:
texto = open("dracula.txt", encoding = "utf-8")

In [69]:
libro_drac=TextBlob(texto.read())

In [75]:
libro_drac.words.count("a")

2978

O puedes contar las frases especificas mediante el metodo count en un archivo ya tokenizado, intentalo con la frase "lady capulet"

### 12.3.5.- Definiciones, sinonimos y antonimos

Puedes buscar definiciones desde textblob

Prueba con la definición de "worker"

In [77]:
prueba=Word("worker")

In [78]:
prueba.definitions

['a person who works at a specific occupation',
 'a member of the working class (not necessarily employed)',
 'sterile member of a colony of social insects that forages for food and cares for the larvae',
 'a person who acts and gets things done']

O sinonimos también, con synsets

In [79]:
prueba.synsets

[Synset('worker.n.01'),
 Synset('proletarian.n.01'),
 Synset('worker.n.03'),
 Synset('actor.n.02')]

### 12.3.6.- Stop Words

Las stop words son palabras que generalmente no aportan información útil para un analisis de machine learning. 

Hay que traerlas desde ntlk

In [None]:
import nltk

In [81]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Techie10\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [82]:
from nltk.corpus import stopwords

In [83]:
stops = stopwords.words ("english")

In [84]:
print(stops)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Vamos a crear un texto como I have a beautiful day

In [99]:
blob = TextBlob("I have a beautiful day")

Y eliminar su stop words

In [95]:
vacio=[]
for palabra in blob.words:
    if palabra not in stops:
        vacio.append(palabra)

print(vacio)

['I', 'have', 'beautiful']


In [100]:
[pal for pal in blob.words if pal.lower() not in stops]

['I', 'have', 'beautiful']

## 12.4.- Visualizando frecuencias de palabras

Vamos a armar un word cloud de dracula, comencemos volviendo a cargar dracula

Ahora vamos cargando las stop words de inglés

Vamos a obtener las frecuencias de palabras

Ahor avamos a eliminar las stop words

Ordenamos las palabras restantes por frecuencia

Conseguimos las top 20 palabras

Luego convertimos el top 20 a un dataframe

Y visualizamos el dataframe en una grafica de barras sencillita

### 12.4.1.- Word Cloud

Ahora necesitamos instalar el módulo WordCloud

Cargamos las librerias que nos importan

Vamos a cargar el texto de tracua y las palabras stop

Vamos a crear una mascara para la nube usando la funcion impread

OK, ahora ponemos algunas caracteristicas especifcas de la nube a crear

Luego se aplica el metodo de generar wordcloud

Y la guardas como imagen

## 12.4.- Reconocimiento de Entidades Nombradas con spaCy

Instala Spacy desde Prompt

Carga el modelo de lenguaje

In [101]:
import spacy

ModuleNotFoundError: No module named 'spacy'

In [None]:
nlp = spacy.load("en_core_web_sm")

Crea un documento de spacy con el texto airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com

Obten las entidades nombradas

In [None]:
documento = nlp ("airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com")

In [None]:
for entity in documento.ents: #Capaz de sacar nombres propios con lenguaje natural
    print(f'{entity.text:{entity.label_}}')