# NLP

NLP trata de aplicaciones que entiendan nuestro idioma, reconocimiento de voz, traducción, comprensión semántica, análisis de sentimiento..

**Usos**

+ Motores de búsqueda
+ Feed de redes sociales
+ Asistentes de voz 
+ Filtros de span
+ Chatbots

**Librerías**

+ NLTK
+ Spacy
+ TFIDF
+ OpenNLP

La dificultad del NLP está en varios niveles:

+ Ambigüedad:

  * Nivel léxico: por ejemplo, varios significados
  * Nivel referencial: anáforas, metáforas, etc...
  * Nivel estructural: la semántica es necesaria para entender la estructura de una oración
  * Nivel pragmático: dobles sentidos, ironía, humor
  
+ Detección de espacios
+ Recepción imperfecta: acentos, -ismos, OCR

El proceso es similar que en USL, primero se vectorizan las palabras y después se miden sus distancias/similitudes. 

In [1]:
# lista de 100 peliculas

titles=open('data/title_list.txt').read().split('\n')[:100]

titles[:15]

['The Godfather',
 'The Shawshank Redemption',
 "Schindler's List",
 'Raging Bull',
 'Casablanca',
 "One Flew Over the Cuckoo's Nest",
 'Gone with the Wind',
 'Citizen Kane',
 'The Wizard of Oz',
 'Titanic',
 'Lawrence of Arabia',
 'The Godfather: Part II',
 'Psycho',
 'Sunset Blvd.',
 'Vertigo']

In [2]:
synopsis=open('data/synopses_list.txt').read().split('\n BREAKS HERE')[:100]

synopsis[0][:200]

" Plot  [edit]  [  [  edit  edit  ]  ]  \n  On the day of his only daughter's wedding, Vito Corleone hears requests in his role as the Godfather, the Don of a New York crime family. Vito's youngest son,"

### Limpieza

In [3]:
#!pip3 install spacy

In [4]:
import string
import spacy

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

import re

In [5]:
nlp=spacy.load('en')

parser=English()

In [6]:
def spacy_tokenizer(sentence):
    
    tokens=parser(sentence)
    
    filtered_tokens=[]
    for word in tokens:
        lemma=word.lemma_.lower().strip()
        
        if lemma not in STOP_WORDS and re.search('^[a-zA-Z]+$', lemma):
            filtered_tokens.append(lemma)
            
    return filtered_tokens

In [7]:
spacy_tokenizer(synopsis[0][:200])

['plot',
 'edit',
 'edit',
 'edit',
 'day',
 'daughter',
 'wedding',
 'vito',
 'corleone',
 'hears',
 'requests',
 'role',
 'godfather',
 'don',
 'new',
 'york',
 'crime',
 'family',
 'vito',
 'youngest',
 'son']

### TFIDF (term frequency inverse document frequency)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
tfidf=TfidfVectorizer(min_df=.15, tokenizer=spacy_tokenizer)

In [10]:
tfidf_matrix=tfidf.fit_transform(synopsis)

In [11]:
tfidf_matrix.shape

(100, 254)

In [12]:
tfidf_matrix

<100x254 sparse matrix of type '<class 'numpy.float64'>'
	with 6489 stored elements in Compressed Sparse Row format>

In [13]:
import pandas as pd
pd.DataFrame(tfidf_matrix).head()

Unnamed: 0,0
0,"(0, 171)\t0.015782503757703084\n (0, 53)\t0..."
1,"(0, 171)\t0.016707243705417485\n (0, 53)\t0..."
2,"(0, 171)\t0.017879690869526076\n (0, 53)\t0..."
3,"(0, 171)\t0.012541408775298609\n (0, 53)\t0..."
4,"(0, 171)\t0.016972872543944847\n (0, 53)\t0..."


In [14]:
terms=tfidf.get_feature_names()
terms[:15]

['able',
 'agrees',
 'air',
 'american',
 'apartment',
 'army',
 'arrive',
 'arrives',
 'asks',
 'attack',
 'attempt',
 'attempts',
 'attention',
 'away',
 'battle']

### Distancias

In [15]:
from sklearn.metrics.pairwise import cosine_similarity as cos

In [16]:
dist=1-cos(tfidf_matrix)

dist.shape

(100, 100)

### Clustering

In [17]:
import warnings
warnings.simplefilter('ignore')

import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')

import numpy as np

In [18]:
from umap import UMAP