# Análisis de Sentimiento con KNN y PCA

Vamos a usar el dataset de IMDB de [Maas et al(2011)](
https://ai.stanford.edu/~amaas/data/sentiment/) 


In [1]:
!wget https://github.com/finiteautomata/imdb-dataset/raw/master/imdb_dataset.csv.zip
!unzip imdb_dataset.csv.zip

--2020-05-29 22:37:15--  https://github.com/finiteautomata/imdb-dataset/raw/master/imdb_dataset.csv.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/finiteautomata/imdb-dataset/master/imdb_dataset.csv.zip [following]
--2020-05-29 22:37:15--  https://raw.githubusercontent.com/finiteautomata/imdb-dataset/master/imdb_dataset.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26962657 (26M) [application/zip]
Saving to: ‘imdb_dataset.csv.zip’


2020-05-29 22:37:16 (42.4 MB/s) - ‘imdb_dataset.csv.zip’ saved [26962657/26962657]

Archive:  imdb_dataset.csv.zip
  inflating: IMDB Dataset.csv        


In [2]:
import pandas as pd 

df = pd.read_csv("IMDB Dataset.csv")


print("Cantidad de documentos: {}".format(df.shape[0]))

Cantidad de documentos: 50000


In [3]:
pd.options.display.max_colwidth = 200

df[:3]

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",positive


Lo mezclamos para que no esté ordenado

In [66]:
# Esto pide un sample, le pedimos una muestra de todo el df
df = df.sample(frac=1)

df[:10]

Unnamed: 0,review,sentiment
17393,"This 1977 cult movie has two crazed lesbians (Sandra Locke & Colleen Camp) appearing at the home of wealthy socialite Doctor George Manning (Seymour Cassel), in hope of help in locating a residenc...",negative
10900,"The Reader is an exceptionally well done and very sweet short. Every element of the piece assists in eliciting a pure emotional response to the script. Well acted, directed, shot and written. I wa...",positive
32259,"*** Spoilers*<br /><br />My dad had taped this movie for me when I was 3. By age 5, I had watched it over 400 times. I just watched it and watched it. And I still do today! It has a grim storyline...",positive
47945,"Inane, awful farce basically about a young man who refuses to conform or better uses non-conformity to attain his objectives-fool his parents into thinking that he is attending a college. Truth is...",negative
27618,"I really liked this movie, it totally reminds me of my high school days. The soundtrack is awesome. I am a huge nic cage fan and this is my favorite movie that he is in. I love the storyline, it i...",positive
35974,"my friend and i rented this one a few nights ago. and, i must say, this is the single best movie i have ever seen. i mean, woah! ""dude, we better get some brew before this joint closes"" and ""dude,...",negative
32834,"if filming is about vision and real life this movie is quite perfect: NUOVOMONDO talks about immigration in the USA from Italy the beginning of '900, but it speak also for now, when emigration/imm...",positive
6132,"I couldn't wait to see this movie. About half way through the movie, I couldn't wait for it to end. All of the (white) actors were delivering their lines like Woody Allen had just said, ""Say it li...",negative
2966,"Why?!! This was an insipid, uninspired and embarrassing film. The embarrassment comes from being from the city where they made it...Pittsburgh PA! Why did they let these people do such a BAAAAAD m...",negative
20460,"I have heard a lot about this film, with people writing me telling me I should see it, as I am a fan of extremely bloody, gory movies. I got my hands on it almost right away, but one thing or anot...",negative


## Train y Test

Nos vamos a quedar con una fracción de los datos para train y otra para test

In [67]:
import sklearn

df_train = df[:10000]
df_test = df[10000:13000]

text_train, text_test = df_train["review"], df_test["review"]
label_train, label_test = df_train["sentiment"], df_test["sentiment"]

print("Class balance : {} pos {} neg".format(
    (label_train == 'positive').sum() / label_train.shape[0], 
    (label_train == 'negative').sum() / label_train.shape[0]
))

Class balance : 0.4971 pos 0.5029 neg


Está más o menos parejo. Usemos accuracy (#cantidad de aciertos / #cantidad de ensayos) como métrica

## Convertir a bag of words

Veamos cómo funciona CountVectorizer

La idea general es que CountVectorizer convierte un conjunto de texto en el modelo de bolsa de palabras (bag of words), donde cada texto se representa como un vector de $\mathbb{R}^V$, donde $V$ es el vocabulario elegido.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

textos = [
    "bolsa de palabras",
    "bolsa es una palabra",
    "palabra no es una bolsa",
]

vect = CountVectorizer()

Lo entramos a estos textos

In [7]:
vect.fit(textos)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [8]:
 vect.vocabulary_

{'bolsa': 0, 'de': 1, 'es': 2, 'no': 3, 'palabra': 4, 'palabras': 5, 'una': 6}

In [9]:
mat = vect.transform(textos)

mat

<3x7 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

Es una matriz "rala" (ESPARSA NO ES UNA PALABRA DE ESPAÑOL)

In [10]:
mat = mat.todense()

mat

matrix([[1, 1, 0, 0, 0, 1, 0],
        [1, 0, 1, 0, 1, 0, 1],
        [1, 0, 1, 1, 1, 0, 1]])

Ahora, apliquemos esto a nuestros textos...

No nos vamos a quedar con todas las palabras:

- Sacar palabras muy frecuentes
- Sacar palabras que aparecen muy pocas veces 

¿Por qué sirve esto?

In [11]:
vect = CountVectorizer()

vect.fit(text_train)

len(vect.vocabulary_)

52296

Esto es un montón. Reduzcámoslo un poco

In [12]:
vect = CountVectorizer(min_df=3, max_features=5000)

vect.fit(text_train)

len(vect.vocabulary_)

5000

In [None]:
X_train = vect.transform(text_train)
X_test = vect.transform(text_test)

y_train = label_train# == 'positive' # Convertimos a vectores booleanos
y_test = label_test# == "positive"

In [14]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(50)

clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=50, p=2,
                     weights='uniform')

In [15]:
%%time
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print("Accuracy: {}".format(acc))

Accuracy: 0.6663333333333333
CPU times: user 3.7 s, sys: 362 ms, total: 4.07 s
Wall time: 4.09 s


¿Podremos mejorarlo...?

## Metodo de la potencia

Implementar las siguientes funciones (`power_iteration` y `eig`)

In [None]:
import numpy as np

def power_iteration(A, niter=10000, eps=1e-6):
    v_new = np.random.rand(A.shape[0])
    v = np.zeros(A.shape[0])
    
    i = 0
    while i < niter and not np.allclose(v_new, v, atol=eps):
      v = v_new
      v_new = A @ v_new
      v_new = v_new / np.linalg.norm(v_new)
      i += 1

    a = (v @ A @ v.T) / np.linalg.norm(v)
    
    return a, v


In [63]:

D = np.diag([5.0, 4.0, 3.0, 2.0, 1.0])

v = np.ones((D.shape[0], 1))

v = v / np.linalg.norm(v)

# Matriz de Householder
B = np.eye(D.shape[0]) - 2 * (v @ v.T)

# Matriz ya diagonalizada
M = B.T @ D @ B

power_iteration(M)

(4.999999998297475,
 array([-0.59998349,  0.39997524,  0.4000165 ,  0.40001651,  0.40001651]))

In [None]:
def eig(A, num=2, niter=10000, eps=1e-6):
    """
    Calculamos num autovalores y autovectores usando método de la potencia+deflación
    """
    A = A.copy()
    eigenvalues = []
    eigenvectors = np.zeros((A.shape[0], num))
    for i in range(num):
        eigenvalue, eigenvector = power_iteration(A, niter, eps)
        eigenvalues.append(eigenvalue)
        eigenvectors[i, :] = eigenvector
        A = A - eigenvalue * np.outer(eigenvector, eigenvector)
    
    return np.array(eigenvalues), eigenvectors


In [65]:
A = np.array([
              [0, 1],
              [-2, -3]
])

vals, vectors = eig(A)
print(vals)
print(vectors)

[-2. -1.]
[[ 0.4472136  -0.89442719]
 [ 0.14142136 -0.98994949]]
