# Ejemplo 6: NLTK Text

__Objetivos__

- Aprender a usar el objeto Text de la librería Nltk

__Desarrollo__

Vamos a utilizar la librería Nltk para explorar algunas técnicas básicas de procesamiento de lenguaje natural. Muchos de estos procedimientos normalmente sirven para preparar nuestros datos para entrenar un modelo o para realizar una visualización.

---

In [1]:
import pandas as pd
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_json('../../datasets/new_york_times_bestsellers-clean.json')

df.head()

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Aliens have taken control of the minds and bod...,"Little, Brown",THE HOST,5b4aa4ead3089013507db18c,1211587200000,1212883200000,2,1,3,25.99
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,A woman's happy marriage is shaken when she en...,St. Martin's,LOVE THE ONE YOU'RE WITH,5b4aa4ead3089013507db18d,1211587200000,1212883200000,3,2,2,24.95
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,1211587200000,1212883200000,4,0,1,22.95
3,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,An aging porn queens aims to cap her career by...,Doubleday,SNUFF,5b4aa4ead3089013507db18f,1211587200000,1212883200000,5,0,1,24.95
4,http://www.amazon.com/Sundays-at-Tiffanys-Jame...,James Patterson and Gabrielle Charbonnet,A woman finds an unexpected love,"Little, Brown",SUNDAYS AT TIFFANY’S,5b4aa4ead3089013507db190,1211587200000,1212883200000,6,3,4,24.99


In [3]:
grouped_by_title = df.groupby('title')['description'].max()
grouped_by_title

title
10TH ANNIVERSARY            Detective Lindsay Boxer and the Women’s Murder...
11TH HOUR                   Detective Lindsay Boxer and the Women’s Murder...
1225 CHRISTMAS TREE LANE    Puppies and an ex-husband loom large in the la...
1356                        In the fourth book of the Grail Quest series, ...
1Q84                        In 1980s Tokyo, a woman who punishes perpetrat...
                                                  ...                        
Z                           A novel based on the lives of Zelda and F. Sco...
ZERO DAY                       A military investigator uncovers a conspiracy.
ZERO HISTORY                Several characters from “Spook Country” return...
ZONE ONE                      Fighting zombies in post-apocalyptic Manhattan.
ZOO                         A young biologist warns world leaders about th...
Name: description, Length: 754, dtype: object

Primero hay que limpiar un poco nuestros textos:

In [4]:

grouped_by_title = grouped_by_title.str.lower()
grouped_by_title = grouped_by_title.str.strip()
grouped_by_title = grouped_by_title.str.replace('[^\w\s]', '')
grouped_by_title = grouped_by_title.str.replace('\d', '')
grouped_by_title = grouped_by_title.str.replace('\\n', '')
grouped_by_title = grouped_by_title.dropna()

grouped_by_title

title
10TH ANNIVERSARY            detective lindsay boxer and the women’s murder...
11TH HOUR                   detective lindsay boxer and the women’s murder...
1225 CHRISTMAS TREE LANE    puppies and an ex-husband loom large in the la...
1356                        in the fourth book of the grail quest series, ...
1Q84                        in 1980s tokyo, a woman who punishes perpetrat...
                                                  ...                        
Z                           a novel based on the lives of zelda and f. sco...
ZERO DAY                       a military investigator uncovers a conspiracy.
ZERO HISTORY                several characters from “spook country” return...
ZONE ONE                      fighting zombies in post-apocalyptic manhattan.
ZOO                         a young biologist warns world leaders about th...
Name: description, Length: 754, dtype: object

Construir un objeto Text con nuestros datos.

Requisito:

````
nltk.download('punkt')
````


In [5]:
# Separamos cada oración en palabras
tokenized = grouped_by_title.apply(nltk.word_tokenize)
tokenized

title
10TH ANNIVERSARY            [detective, lindsay, boxer, and, the, women, ’...
11TH HOUR                   [detective, lindsay, boxer, and, the, women, ’...
1225 CHRISTMAS TREE LANE    [puppies, and, an, ex-husband, loom, large, in...
1356                        [in, the, fourth, book, of, the, grail, quest,...
1Q84                        [in, 1980s, tokyo, ,, a, woman, who, punishes,...
                                                  ...                        
Z                           [a, novel, based, on, the, lives, of, zelda, a...
ZERO DAY                    [a, military, investigator, uncovers, a, consp...
ZERO HISTORY                [several, characters, from, “, spook, country,...
ZONE ONE                    [fighting, zombies, in, post-apocalyptic, manh...
ZOO                         [a, young, biologist, warns, world, leaders, a...
Name: description, Length: 754, dtype: object

In [6]:
# Sumamos todas las listas para obtener una lista con todas las palabras en nuestro conjunto de datos
# Creamos un objeto Text

all_words = tokenized.sum()
text = nltk.Text(all_words)

text

<Text: detective lindsay boxer and the women ’ s...>

`concordance()` : Imprime una concordancia para la palabra. La coincidencia de palabras no distingue entre mayúsculas y minúsculas.

[Referencia](https://tedboy.github.io/nlps/generated/generated/nltk.Text.concordance.html?highlight=concordance#nltk.Text.concordance)

In [7]:
# Concordane muestra la ocurrencia de cada palabra
text.concordance('woman', lines=20)

Displaying 20 of 73 matches:
tle of poitiers . in 1980s tokyo , a woman who punishes perpetrators of domesti
 mishandling an autopsy . a pregnant woman shows up in cedar cove on christmas 
r in a room above a stable . a young woman ’ s life is transformed by a mountai
othing is as it seems . a middle-age woman takes a cross-country road trip with
 . a young , beautiful and ambitious woman ruthlessly ascends the heights of th
ng of humans and heavenly beings . a woman in her late 30s marries the man of h
phecy about the end of the world . a woman ’ s life is complicated by the fact 
ichidian universe , a smuggler and a woman warrior must fight together to survi
loosa trilogy , two lawmen protect a woman one of them loves . in french ’ s fo
an arcane society novel . a southern woman is forever changed by the betrayals 
ips ’ s earlier novels reappear as a woman persuades a friend to call off her w
ot to kill thousands of citizens . a woman asks the boston detective d.d . warr
ker in purs

 Similitud de distribución `similar()`: encuentra otras palabras que aparezcan en los mismos contextos que la palabra especificada; lista las palabras más similares primero.

In [8]:
# similar regresa las palabras que aparecen en contextos similares al argumento
text.similar('woman')

man widow detective killer series war family case doctor friend boy
target yacht sheriff murder accident dog nanny group secret


In [9]:

text.similar('women')

murder war world love crimes the killer battle resent administration
country president eve life recovery night state summer eyes cop


In [10]:
text.concordance('man', lines=20)

Displaying 20 of 36 matches:
. a woman in her late 30s marries the man of her dreams and reaches out to his 
r hides his male lover . a former hit man for the mob who has become a doctor i
y the betrayals of her mother and the man she loves . intrigue on the planet sa
85 . two agents are tracking the same man , a human trafficker who is now deali
lorida for a missing girl and the con man who seduced her . a runaway girl and 
deployed to iraq . a distraught young man discovers that he has grown horns . a
mpire of charis fights to survive . a man who kidnapped a 15-year-old girl cont
i.a . stand in his way . when a young man finds a bag of diamonds , he gets the
stigator maisie dobbs helps an indian man whose sister ’ s murder has been igno
ttacks . a woman , her daughter and a man accused of murder evade the authoriti
ooper becomes involved when a wealthy man assaults a maid in a manhattan hotel 
as christmas nears , a terminally ill man is preparing his family for his death
 air force 

In [11]:
text.similar('man')

woman killer widow murder war mystery case vampire disappearance boy
target murderer yacht shooting priest detective the women baby series


In [12]:
text.similar('men')

book french president room novel west governor truth culprit males


`common_context()` : encuentra contextos en los que puedan aparecer todas las palabras especificadas; y devuelva una distribución de frecuencia que asigne cada contexto al número de veces que se utilizó ese contexto.

[Referencia](https://tedboy.github.io/nlps/generated/generated/nltk.ContextIndex.common_contexts.html?highlight=common%20_contexts#nltk.ContextIndex.common_contexts)

In [13]:
# common_contexts regresa los contextos que comparten dos o más palabras
text.common_contexts(['woman', 'man'])

a_in a_with a_who young_s


In [14]:
# Total de palabras
len(text)

14545

In [15]:
# Total de palabras distintas
len(set(text))

3226

In [16]:

# Cuantificando la riqueza léxica
len(set(text)) / len(text)

0.22179443107597113

`collocations()` : Imprime las colocaciones derivadas del texto, ignorando _stopwords_.

In [17]:
# Las colocaciones son pares de palabras (bigramas) que son inusualmente comunes en nuestro conjunto de datos
text.collocations()

new york; serial killer; stone barrington; los angeles; writing
pseudonymously; nora roberts; eve dallas; lt. eve; sookie stackhouse;
anita blake; north carolina; eve duncan; dagger brotherhood; doc ford;
jason bourne; lacey sherlock; mitch rapp; temperance brennan; forensic
sculptor; alex cross
