# 12.- Procesamiento de Lenguaje Natural

### 12.0.1.- Instalando TextBlob

Lo primero será instalar TextBlob desde el prompt de anaconda

conda install -c confa-forge textblob

Una vez ejecutado, corre el siguiente comando

ipython -m textblob.download_corpora

## 12.2.- TextBlob

### 12.2.2.- Crear un TextBlob

Vamos a crear un textblob con el texto 'Y cuando despertó. Todo a su alrededor era luminoso'

In [1]:
from textblob import TextBlob

In [2]:
texto = 'Y cuando despertó. Todo a su alrededor era luminoso'

In [3]:
blob = TextBlob(texto) # es mas o menos un string

In [4]:
blob

TextBlob("Y cuando despertó. Todo a su alrededor era luminoso")

### 12.2.3.- Tokenizar

Ahora vamos a obtener una lista de oraciones

In [10]:
#import nltk
#nltk.download("punkt_tab")

blob.sentences

[Sentence("Y cuando despertó."), Sentence("Todo a su alrededor era luminoso")]

Y una lista de palabras

In [11]:
blob.words

WordList(['Y', 'cuando', 'despertó', 'Todo', 'a', 'su', 'alrededor', 'era', 'luminoso'])

Repetimos para 'I have good discipline to study. In the future I will be a great engineer'

In [13]:
blob2 = TextBlob('I have good discipline to study. In the future I will be a great engineer')
blob2.sentences

[Sentence("I have good discipline to study."),
 Sentence("In the future I will be a great engineer")]

In [14]:
blob2.words

WordList(['I', 'have', 'good', 'discipline', 'to', 'study', 'In', 'the', 'future', 'I', 'will', 'be', 'a', 'great', 'engineer'])

### 12.2.4.- Parte del habla

Ahora vamos a etiquetar cada parte de la oración en su categoría de sustantivo, pronotmbre, verbo, etc

In [17]:
#import nltk
#nltk.download('averaged_perceptron_tagger_eng')

blob2.tags?


[1;31mType:[0m        list
[1;31mString form:[0m [('I', 'PRP'), ('have', 'VBP'), ('good', 'JJ'), ('discipline', 'NN'), ('to', 'TO'), ('study', 'VB <...> '), ('I', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('a', 'DT'), ('great', 'JJ'), ('engineer', 'NN')]
[1;31mLength:[0m      15
[1;31mDocstring:[0m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.

### 12.2.5.- Frases de sustantivos

Le podemos pedir noun phrases a nuestro textblob

In [18]:
blob2.noun_phrases

WordList(['good discipline', 'great engineer'])

### 12.2.6.- Analisis de sentimientos

También podemos analizar el sentimento de una oración. Vamos dandole con l oración que traiamos

Ahora otro ejemplo = 'I am a bad student. I will not be able to finish my studies'

In [46]:
texto ="I am a bad student. I will not be able to finish my studies"
blob3 = TextBlob(texto)

In [47]:
blob3.sentiment
# polaridad negativa malo, positiva bueno representao en porcentaje
# subjetividad porque cuesta procesarlo todo entero(frases juntas) = % de que tanto seguro esta de su decision 

Sentiment(polarity=-0.09999999999999992, subjectivity=0.6458333333333333)

Y lo repetimos para el mismo ejemplo pero frase por frase

In [48]:
[blob3.sentiment for s in blob3.sentences]

[Sentiment(polarity=-0.09999999999999992, subjectivity=0.6458333333333333),
 Sentiment(polarity=-0.09999999999999992, subjectivity=0.6458333333333333)]

Se puede hacer lo mismo mediante NaiveBayes (en vez de el default pattern)

In [51]:
from textblob.sentiments import NaiveBayesAnalyzer

In [52]:
blob3 = TextBlob(texto, analyzer = NaiveBayesAnalyzer())

In [53]:
blob3

TextBlob("I am a bad student. I will not be able to finish my studies")

In [54]:
blob3.sentiment

Sentiment(classification='pos', p_pos=0.9111144935259595, p_neg=0.08888550647403913)

## 12.3.- Detección de lenguaje y traducción

Escribe Bonjour y determina el lenguaje usando textblob

In [65]:
texto = "hola"

In [66]:
blob = TextBlob(texto)

In [63]:
blob.detect_language() # solo funciona si tienes conexion a internet

HTTPError: HTTP Error 400: Bad Request

In [59]:
!pip install langdetect
from langdetect import detect


Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ------------------------------------- 981.5/981.5 kB 11.5 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993250 sha256=9121ab912ff0d6226c4be8fc75c57f987424fb4804d36a7f466294715342c7c1
  Stored in directory: c:\users\techie3\appdata\local\pip\cache\wheels\eb\87\25\2dddf1c94e1786054e25022ec5530bfed52bad86d882999c48
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


  DEPRECATION: Building 'langdetect' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'langdetect'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [67]:
print(detect(texto))

cy


Ahora usa textblob para traducir "I have good discipline to study. In the future I will be a great engineer'

In [68]:
blob = TextBlob("I have good discipline to study. In the future I will be a great engineer")

In [69]:
mifrese = blob.translate(blob)

InvalidURL: URL can't contain control characters. '/translate_a/t?client=webapp&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=2&ssel=0&tsel=0&kc=1&sl=I have good discipline to study. In the future I will be a great engineer&tl=en&hl=en&tk=787765.669259' (found at least ' ')

In [70]:
!pip install deep-translator
from deep_translator import GoogleTranslator

Collecting deep-translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
Installing collected packages: deep-translator
Successfully installed deep-translator-1.11.4


In [72]:
texto = "I have good discipline to study. In the future I will be a great engineer"
traduccion = GoogleTranslator(source = "en", target="es").translate(texto)

In [73]:
print(traduccion)

Tengo buena disciplina para estudiar. En el futuro seré un gran ingeniero.


### 12.3.1.- Inflección - pluralización y singularización

Textblob también entiende de singulares y plurales, obten el plural de party y el singular de lives

In [74]:
from textblob import Word

Pluraliza Potato, Tomato, Carrot

In [75]:
vegetal = Word("potato")
vegetal.pluralize()

'potatoes'

In [76]:
vegetal = Word("Tomato")
vegetal.pluralize()

'Tomatoes'

In [83]:
vegetal = Word("carrot")
vegetal.pluralize()

'carrots'

### 12.3.2.- Spell Check

Python tambien puede revisar ortografia, revisa que opina de la palabra whife

In [85]:
palabra = Word("nife")

In [86]:
palabra.spellcheck()

[('life', 0.6231369765791341),
 ('wife', 0.2604684173172463),
 ('nine', 0.0454222853087296),
 ('nice', 0.03761533002129169),
 ('knife', 0.029808374733853796),
 ('nile', 0.0021291696238466998),
 ('rife', 0.0014194464158978)]

También puedes revisar ortografía de una oración completa como 'Yestarday was a bab dai'

In [88]:
fras = TextBlob("Yestarday was a bab dai")

In [89]:
fras.correct()

TextBlob("Yesterday was a bad day")

In [None]:
fras = TextBlob("Yestarday was a bab dai")

In [92]:
fras = TextBlob("nowadais wi wont kave gud gaimes")

In [93]:
fras.correct()

TextBlob("nowadays i wont have god games")

### 12.3.3.- Normalización

Se puede obtener  el stem y lemma de una palabra plural como dormitories

In [97]:
w1 = Word("men")
print("Word",w1)
print("Singuralize",w1.singularize())

w2 = Word("running")
print("\nWord",w2)
print("Lemmatize",w1.singularize("v"))


Word men
Singuralize man

Word running


TypeError: Word.singularize() takes 1 positional argument but 2 were given

### 12.3.4.- Frecuencias de Palabras

Que pasa si quieres conocer la frecuencia de una palabra?

Importa el texto de dracula.txt y cuenta las veces que aparece la palabra crucifix, dracula, blood

In [98]:
from pathlib import Path

In [100]:
texto = open("dracula.txt", encoding="utf-8")

In [102]:
libro_drac = TextBlob(texto.read())

In [140]:
libro_drac.words.count("the")

8035

O puedes contar las frases especificas mediante el metodo count en un archivo ya tokenizado, intentalo con la frase "lady capulet"

### 12.3.5.- Definiciones, sinonimos y antonimos

Puedes buscar definiciones desde textblob

Prueba con la definición de "worker"

In [122]:
prueba = Word("communist")

In [123]:
prueba.definitions

['a member of the communist party',
 'a socialist who advocates communism',
 'relating to or marked by communism']

O sinonimos también, con synsets

In [124]:
prueba.synsets

[Synset('communist.n.01'), Synset('communist.n.02'), Synset('communist.a.01')]

### 12.3.6.- Stop Words

Las stop words son palabras que generalmente no aportan información útil para un analisis de machine learning. 

Hay que traerlas desde ntlk

In [125]:
import nltk

In [126]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Techie3\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [128]:
from nltk.corpus import stopwords

In [129]:
stops = stopwords.words("english")

In [130]:
print(stops)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Vamos a crear un texto como I have a beautiful day

In [169]:
blob = TextBlob("I have a beautiful day")

Y eliminar su stop words

In [170]:
blob = [i for i in blob.words if (i.lower() not in stops)]

In [174]:
blob = TextBlob(str(blob))
blob

TextBlob("['beautiful', 'day']")

## 12.4.- Visualizando frecuencias de palabras

Vamos a armar un word cloud de dracula, comencemos volviendo a cargar dracula

In [175]:
# lo que entra en el examen del lunes
#ejericios 6 de logicas de programacion
#pista son ejercicios que se usan en entrevistas de trabajo
# como lo del iterativo
#mismo ejercicio de webscraping
#otro de orientada a objetos


In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

Ahora vamos cargando las stop words de inglés

Vamos a obtener las frecuencias de palabras

Ahor avamos a eliminar las stop words

Ordenamos las palabras restantes por frecuencia

Conseguimos las top 20 palabras

Luego convertimos el top 20 a un dataframe

Y visualizamos el dataframe en una grafica de barras sencillita

### 12.4.1.- Word Cloud

Ahora necesitamos instalar el módulo WordCloud

Cargamos las librerias que nos importan

Vamos a cargar el texto de tracua y las palabras stop

Vamos a crear una mascara para la nube usando la funcion impread

OK, ahora ponemos algunas caracteristicas especifcas de la nube a crear

Luego se aplica el metodo de generar wordcloud

Y la guardas como imagen

## 12.4.- Reconocimiento de Entidades Nombradas con spaCy

Instala Spacy desde Prompt

Carga el modelo de lenguaje

In [176]:
import spacy

In [177]:
nlp = spacy.load("en_core_web_sm")

Crea un documento de spacy con el texto airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com

Obten las entidades nombradas

In [178]:
doc= nlp("Crea un documento de spacy con el texto airbnb is an American company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. Based in San Francisco, California, the platform is accessible via website and mobile app. Airbnb does not own any of the listed properties; instead, it profits by receiving commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk and Joe Gebbia. Airbnb is a shortened version of its original name, AirBedandBreakfast.com")

In [179]:
for entity in doc.ents: #Capaz de sacar nombres propios con lenguaje natural
    print(f'{entity.text}:{entity.label_}')

el texto airbnb:ORG
American:NORP
San Francisco:GPE
California:GPE
2008:DATE
Brian Chesky:PERSON
Nathan Blecharczyk:PERSON
Joe Gebbia:PERSON


In [None]:
doc.