# NLP y Clasificacion con Machine Learning


Reseñas de peliculas del dataset publico *txt_sentoken* de Kaggle, para el procesamiento de lenguaje natural.

Clasificacion con distintos algoritmos de Machine Learning aplicados al analisis de sentimientos. 




In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [2]:
from sklearn.datasets import load_files
import numpy as np

In [3]:
files = load_files("drive/My Drive/Mineria_Datos/txt_sentoken")


### Preprocesamiento de los textos

Funciones para eliminar signos de puntuacion, caracteres raros.
Expande contracciones y remueve stopwords. 

In [4]:
X = files.data
Y = files.target


In [5]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
import re

def remove_punctuation(words):
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)  
        if new_word != '':
            new_words.append(new_word)
    return new_words


In [7]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')


def remove_stopwords(words):
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def contar_letras_sentencias_token(token):
  contador = 0
  for t in token: 
    for w in t:
      #print(w)
      contador += 1
      
  return contador



Para las primeras 10 reseñas, reducimos la dimensionalidad de los vectores de palabras, y calculamos la diferencia de tamaño entre el texto original y el texto procesado



In [9]:
from nltk.tokenize import word_tokenize


for i in range(0,10):
  res = str(X[i])
  print("Texto original -> ",res) 
  print("Cantidad de caracteres inicial: ",len(res), " \n\n")

  palabras = word_tokenize(res[2:]) #elimina primer caracter basura 
  
  print("Texto tokenizado en palabras -> ",palabras, "\n\n")
 
  
  texto_limpio = []
  
  for p in palabras:  #para cada sentencia le limpio el \n, que seria el s[:2]
    
    if (p[:2] == "\\n"): #curioso... el \n se representa como \\n
      p = p[2:]
    
    p = re.sub(r'[^\w\s]', '', p)  # elimino signos de puntuacion

    if (p == "s"): # elimina s colgada de it's, o 80's
      p = ""

    if (p != ""): #si no es un espacio en blanco, lo agrego al texto limpio
      texto_limpio.append(p)


  print("Texto limpio -> ",texto_limpio)
  print("Cantidad de caracteres texto limpio: ",contar_letras_sentencias_token(texto_limpio), "\n\n")
  
  sin_stop = remove_stopwords(texto_limpio)
  print("Texto limpio y sin stopwords -> ",sin_stop)
  print("Cantidad de caracteres final sin stopwords: ", contar_letras_sentencias_token(sin_stop),"\n\n")
  print("Texto original reducido en un -> ",contar_letras_sentencias_token(sin_stop)/len(res))
  print("\n----------------------------------------------------------------------------------------\n")

Texto original ->  b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \

##### Podemos notar que la dimensionalidad se reduce en promedio hasta un 50% y el texto sigue siendo entendible.



In [10]:
# defino preprocesador para mandar al pipe

def preprocesador(texto):
  texto = re.sub(r'[^\w\s]', '', texto) # elimino puntuaciones
  texto = re.sub(r'[\\W]','', texto) # elimino caracteres raros

  #if (p == "s"): #s colgada de it's, o 80's
  #  p = ""

  #print(texto)
  return texto

### Dividir dataset en 80% entrenamiento y 20% prueba

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(files.data, files.target, test_size=0.2)

print("Cantidad de datos de entrenamiento: ",len(X_train))
print("Cantidad de datos de prueba: ",len(X_test))

Cantidad de datos de entrenamiento:  1600
Cantidad de datos de prueba:  400


# Clasificadores

Evalua primero la media de precision; despues evalua con GridSearch

GridSearch permite seleccionar los hiperparametros adecuados para cada clasificador.

Por ultimo usa kfolds cross validation para comparar resultados.


In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


## Support Vector Machines


###SGDClassifier 


In [13]:
from sklearn.linear_model import SGDClassifier


clasificador_sgd = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                             ('TRANSFORMADOR-tfidf', TfidfTransformer()),
                             ('CLASIFICADOR-SGD', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3))])


clasificador_sgd = clasificador_sgd.fit(X_train, y_train)
predicted_sgd = clasificador_sgd.predict(X_test)

#print ("Clases identificadas : ", predicted_sgd)
#print ("Clases reales del doc: ", y_test)


In [14]:
resultados = []

for i in range(predicted_sgd.size):
    if (predicted_sgd[i] == y_test[i]):
        resultados.append(1) #predice bien
    else:
        resultados.append(0) #predice mal

In [15]:
np_mean = np.array(resultados)

In [16]:
print("Clasificacion con SGDClassifier(SVM): ", np.mean(np_mean))

Clasificacion con SGDClassifier(SVM):  0.8225


In [17]:
import time

from sklearn.model_selection import GridSearchCV


parameters_sgd = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
                  'TRANSFORMADOR-tfidf__use_idf': (True, False),
                  'CLASIFICADOR-SGD__alpha': (1e-2, 1e-3)}

start_time = time.time()

gs_clasificador_sgd = GridSearchCV(clasificador_sgd, parameters_sgd, cv=5, n_jobs=-1) 

#print(gs_clf_svm.get_params())
gs_clasificador_sgd = gs_clasificador_sgd.fit(X_train, y_train)


print("--- %s seconds ---" % (time.time() - start_time))

#mejores parametros para SDGClassifier

print(gs_clasificador_sgd.best_score_)
print(gs_clasificador_sgd.best_params_)

--- 215.0866255760193 seconds ---
0.8275
{'CLASIFICADOR-SGD__alpha': 0.001, 'TRANSFORMADOR-tfidf__use_idf': True, 'VECTORIZADOR__ngram_range': (1, 1)}


### Support Vector Classificator (SVC)

In [18]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [19]:
clasificador_svc = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                             ('TRANSFORMADOR-tfidf', TfidfTransformer()),
                             ('CLASIFICADOR-SVC', SVC())])

clasificador_svc = clasificador_svc.fit(X_train, y_train)
predicted_svc = clasificador_svc.predict(X_test)

#print ("Clases identificadas : ", predicted_svc) 
#print ("Clases reales del doc: ", y_test)


In [20]:
print("Clasificacion con SVC: ",np.mean(predicted_svc == y_test))

Clasificacion con SVC:  0.81


In [21]:
parameters = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
              'TRANSFORMADOR-tfidf__use_idf': (True, False),
              #'CLASIFICADOR-SVC__alpha': (1e-2, 1e-3)} 
              #'CLASIFICADOR-SVC__C': (1e-2, 1e-3)}
              'CLASIFICADOR-SVC__gamma': (1,0)}

In [22]:

start_time = time.time()

gs_clasificador_svc = GridSearchCV(clasificador_svc, parameters, cv=5, n_jobs=-1)


#print(gs_svc_clf.get_params().keys())  #asi veo todos los posibles parametros


gs_clasificador_svc = gs_clasificador_svc.fit(X_train, y_train)

print("Precision SVC: ",gs_clasificador_svc.best_score_,"\n")
print("--- %s seconds ---" % (time.time() - start_time))

print(gs_clasificador_svc.best_params_)

Precision SVC:  0.8099999999999999 

--- 425.20132088661194 seconds ---
{'CLASIFICADOR-SVC__gamma': 1, 'TRANSFORMADOR-tfidf__use_idf': True, 'VECTORIZADOR__ngram_range': (1, 1)}


## Naive Bayes


### BernoulliNB

 Para clasificacion binaria

In [23]:
from sklearn.naive_bayes import BernoulliNB

clasificador_bn = BernoulliNB()

clasificador_bn


BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [24]:
clasificador_bernoulli = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                                   ('TRANSFORMADOR-tfidf', TfidfTransformer()),
                                   ('CLASIFICADOR-BERNOULLI', clasificador_bn)])

clasificador_bernoulli = clasificador_bernoulli.fit(X_train, y_train)
predicted_bernoulli = clasificador_bernoulli.predict(X_test)

#print ("Clases identificadas : ", predicted_bernoulli) 
#print ("Clases reales del doc: ", y_test)

print("Clasificacion con Bernoulli: ",np.mean(predicted_bernoulli == y_test))

Clasificacion con Bernoulli:  0.7925


In [25]:
parameters = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
              'TRANSFORMADOR-tfidf__use_idf': (True, False), 
              'CLASIFICADOR-BERNOULLI__alpha': (1e-2, 1e-3)}


start_time = time.time()

gs_clasificador_bernoulli = GridSearchCV(clasificador_bernoulli, parameters, cv=5, n_jobs=-1)
gs_clasificador_bernoulli = gs_clasificador_bernoulli.fit(X_train, y_train)

print("--- %s seconds ---" % (time.time() - start_time))
print("Precision bernoulli: ",gs_clasificador_bernoulli.best_score_)

print(gs_clasificador_bernoulli.best_params_)

--- 214.77479934692383 seconds ---
Precision bernoulli:  0.79375
{'CLASIFICADOR-BERNOULLI__alpha': 0.001, 'TRANSFORMADOR-tfidf__use_idf': True, 'VECTORIZADOR__ngram_range': (1, 2)}


### Multinomial

In [26]:
from sklearn.naive_bayes import MultinomialNB


clasificador_multinomial = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                                     ('TRANSFORMADOR-tfidf', TfidfTransformer()), 
                                     ('CLASIFICADOR-MULTINOMIAL', MultinomialNB())])


clasificador_multinomial = clasificador_multinomial.fit(X_train, y_train)
predicted_multinomial = clasificador_multinomial.predict(X_test)

#print("Clases identificadas : ", predicted_multinomial) 
#print("Clases reales del doc: ", y_test)

print("Clasificacion con Multinomial: ",np.mean(predicted_multinomial == y_test))

Clasificacion con Multinomial:  0.8125


In [27]:

import time
from sklearn.model_selection import GridSearchCV

parameters = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
              'TRANSFORMADOR-tfidf__use_idf': (True, False), 
              'CLASIFICADOR-MULTINOMIAL__alpha': (1e-2, 1e-3)}


start_time = time.time()

gs_clasificador_multinomial = GridSearchCV(clasificador_multinomial, parameters, cv=5, n_jobs=-1)
gs_clasificador_multinomial = gs_clasificador_multinomial.fit(X_train, y_train)

print("--- %s seconds ---" % (time.time() - start_time))
print("Precision Multinomial: ",gs_clasificador_multinomial.best_score_)

print(gs_clasificador_multinomial.best_params_)

--- 214.50877141952515 seconds ---
Precision Multinomial:  0.8025
{'CLASIFICADOR-MULTINOMIAL__alpha': 0.01, 'TRANSFORMADOR-tfidf__use_idf': False, 'VECTORIZADOR__ngram_range': (1, 2)}


### Vecinos mas cercanos (KNN)


In [28]:
from sklearn import neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=100)
knn

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=100, p=2,
                     weights='uniform')

In [29]:
clasificador_knn = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                             ('TRANSFORMADOR-tfidf', TfidfTransformer()),
                             ('CLASIFICADOR-KNN', knn)]) 


clasificador_knn = clasificador_knn.fit(X_train, y_train)
predicted_knn = clasificador_knn.predict(X_test)

#print ("Clases identificadas : ", predicted_knn)
#print ("Clases reales del doc: ", y_test)

In [30]:
print("Precision de KNN: ",np.mean(predicted_knn == y_test))

#n_neighbors -> 2 = 0.65 , 3 = 0.71 , 4 =  0.66 , 5 = 0.675 , ... , 
#               30 = 0.71 , 50 = 0.7175 , 100 = 0.765 , 250 = 0.765  , 500 = 0.72  , ....

Precision de KNN:  0.775


In [31]:
parameters = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
              'TRANSFORMADOR-tfidf__use_idf': (True, False), 
              #'CLASIFICADOR-SVC__C': (1e-2, 1e-3)}
              'CLASIFICADOR-KNN__p': (1,0)}
              #'CLASIFICADOR-SVC__probability': (1,0)}

In [32]:

start_time = time.time()

gs_clasificador_knn = GridSearchCV(clasificador_knn, parameters, cv=5, n_jobs=-1)


gs_clasificador_knn = gs_clasificador_knn.fit(X_train, y_train)

print(gs_clasificador_knn.get_params().keys())
#print(gs_knn_clf.get_params().keys())  #asi veo todos los posibles parametros

print("Precision knn: ",gs_clasificador_knn.best_score_)  # el parametro del clasificador p no es tan bueno...
print("--- %s seconds ---" % (time.time() - start_time))

print(gs_clasificador_knn.best_params_)

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__verbose', 'estimator__VECTORIZADOR', 'estimator__TRANSFORMADOR-tfidf', 'estimator__CLASIFICADOR-KNN', 'estimator__VECTORIZADOR__analyzer', 'estimator__VECTORIZADOR__binary', 'estimator__VECTORIZADOR__decode_error', 'estimator__VECTORIZADOR__dtype', 'estimator__VECTORIZADOR__encoding', 'estimator__VECTORIZADOR__input', 'estimator__VECTORIZADOR__lowercase', 'estimator__VECTORIZADOR__max_df', 'estimator__VECTORIZADOR__max_features', 'estimator__VECTORIZADOR__min_df', 'estimator__VECTORIZADOR__ngram_range', 'estimator__VECTORIZADOR__preprocessor', 'estimator__VECTORIZADOR__stop_words', 'estimator__VECTORIZADOR__strip_accents', 'estimator__VECTORIZADOR__token_pattern', 'estimator__VECTORIZADOR__tokenizer', 'estimator__VECTORIZADOR__vocabulary', 'estimator__TRANSFORMADOR-tfidf__norm', 'estimator__TRANSFORMADOR-tfidf__smooth_idf', 'estimator__TRANSFORMADOR-tfidf__sublinear_tf', 'estimator__TRANSFORMADOR-tfidf

### Random Forest Classificator

RandomForest es un estimador compuesto por un conjunto de arboles de decision generados de forma aleatoria.

In [33]:
from sklearn.ensemble import RandomForestClassifier

In [34]:
clasificador_rundom_forest = Pipeline([('VECTORIZADOR', CountVectorizer(stop_words='english', tokenizer=word_tokenize, preprocessor=preprocesador)), 
                                       ('TRANSFORMADOR-tfidf', TfidfTransformer()), 
                                       ('CLASIFICADOR-RUN_FOREST', RandomForestClassifier(n_estimators=1250))]) #mas estimadores, mas precision


clasificador_run_forest = clasificador_rundom_forest.fit(X_train, y_train)
predicted_run_forest = clasificador_run_forest.predict(X_test)

#print("Clases identificadas : ", predicted_run_forest) 
#print("Clases reales del doc: ", y_test)

In [35]:
print("Precision de Run Forest: ",np.mean(predicted_run_forest == y_test))

#estimators 10 -> .62
#           50 -> .75
#           100 -> .77
#           250 -> .79 
#           500 -> .8
#           800 -> .8025
#           1000 -> .8025
#           1250 -> .8125
#           1500 -> .8025
#           2000 -> .80

Precision de Run Forest:  0.825


In [36]:
parameters = {'VECTORIZADOR__ngram_range': [(1, 1), (1, 2)], 
              'TRANSFORMADOR-tfidf__use_idf': (True, False), 
              #'CLASIFICADOR-RUN_FOREST__criterion': (1e-2, 1e-3)}
              'CLASIFICADOR-RUN_FOREST__ccp_alpha': (1e-2, 1e-3)}
              #'CLASIFICADOR-SVC__probability': (1,0)}

In [37]:

start_time = time.time()

gs_clasificador_run_forest = GridSearchCV(clasificador_run_forest, parameters, cv=5, n_jobs=-1)

#print(gs_clasificador_run_forest.get_params().keys())  #asi veo todos los posibles parametros

gs_clasificador_run_forest = gs_clasificador_run_forest.fit(X_train, y_train)

print("Precision Random Forest: ",gs_clasificador_run_forest.best_score_) 
print("--- %s seconds ---" % (time.time() - start_time))

print(gs_clasificador_run_forest.best_params_)

#ESAAAAAA ... tarda una banda pero es el mejor clasificador random forest, con .8381 de precision

Precision Random Forest:  0.8331249999999999
--- 1432.325011253357 seconds ---
{'CLASIFICADOR-RUN_FOREST__ccp_alpha': 0.001, 'TRANSFORMADOR-tfidf__use_idf': False, 'VECTORIZADOR__ngram_range': (1, 1)}


### Evaluacion de los clasificadores y conclusiones

Usa cross validation para comparar los clasificadores creados previamente. 

Evalua resultados de los SVM, Naive Bayes y RandomForest

In [38]:
from sklearn.model_selection import cross_val_score

# SGD

#scores = cross_val_score(clasificador_sgd, X_train, y_train, cv=5)
#entrenamiento -> 0.84

scores = cross_val_score(clasificador_sgd, X, Y, cv=5)
#total -> 0.84

print(scores)

print("Precision SGD: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.835  0.845  0.8275 0.865  0.8225]
Precision SGD: 0.84 (+/- 0.03)


In [39]:
# SVC

#scores = cross_val_score(clasificador_svc, X_train, y_train, cv=5)
#entrenamiento -> 0.83

scores = cross_val_score(clasificador_svc, X, Y, cv=5)
#total -> .83

print(scores)

print("Precision SVC: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.845  0.8325 0.8175 0.84   0.82  ]
Precision SVC: 0.83 (+/- 0.02)


In [40]:
# BINOMIAL

#scores = cross_val_score(clasificador_bernoulli, X_train, y_train, cv=5)
#entrenamiento -> 0.78

scores = cross_val_score(clasificador_bernoulli, X, Y, cv=5)
#total ->  0.79

print(scores)

print("Precision Bernoulli/Binomial: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.8125 0.77   0.7875 0.7925 0.7675]
Precision Bernoulli/Binomial: 0.79 (+/- 0.03)


In [41]:
#MULTINOMIAL

#scores = cross_val_score(clasificador_multinomial, X_train, y_train, cv=5)
#entrenamiento -> 0.82

scores = cross_val_score(clasificador_multinomial, X, Y, cv=5)
#total -> 0.82

print(scores)

print("Precision Multinomial: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.82   0.835  0.805  0.8325 0.785 ]
Precision Multinomial: 0.82 (+/- 0.04)


In [42]:
#RANDOM FOREST

#scores = cross_val_score(clasificador_rundom_forest, X_train, y_train, cv=5)
#entrenamiento -> 0.81

scores = cross_val_score(clasificador_rundom_forest, X, Y, cv=5)
#total -> 0.83

print(scores)

print("Precision Random Forest: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.825  0.8225 0.84   0.805  0.83  ]
Precision Random Forest: 0.82 (+/- 0.02)


## Pruebas

Probamos los clasificadores haciendo pruebas con nuevas reseñas de peliculas.



Criticas positivas en español traducidas al ingles

https://www.elcineenlasombra.com/criticas-de-cine/


Criticas negativas a peores peliculas de 2020 

https://variety.com/lists/worst-films-movies-2020/

In [43]:
x_pos1 = "Sean Thorton (John Wayne) arrives in his native village, after many years, with the clear intention of buying the house that belonged to his family. In this negotiation an enemy is won, Victor McLaglen, who turns out to be the brother of the girl Sean has fallen in love with, Mauren O'Hara. The apparently comical scuffles show us a culture anchored in the past with past customs that condition the love between these two people who love each other. Apart from the vicissitudes and barriers that lovers have, there are also two ways of being antagonistic between the two: Thorton is, as the title of the film indicates, a quiet man for reasons that will be glimpsed in the course of history and she, on the contrary, is a veritable whirlwind. However, or at least that is the case in this film, love will manage to overcome most conflicts. When you see a classic from the 1950s, you realize, on the one hand, how much the world has changed and, on the other, that it has hardly changed. The person is the way he is despite the society that surrounds him, the historical moment in which he lives and the place where he is. John Wayne plays the big man from before: big, tough, good and gentleman. Mauren O'Hara, for her part, plays her role as a woman at arms to take very well, making her character so credible that you cannot conceive of the actress in the real world behaving otherwise. He transmits a lot of genius, a very marked personality and a hyperactivity that makes you want to slow down. The brother, Victor McLaglen, acts as an alpha male, as a powerful of the people and tries to mark his territory by brute force: the corpulence, the face and, above all, the rustic gestures of the actor accompany him and are suitable for the script. If we get corny, we could say that it is a macho movie, but I am not going to do it. On the contrary, I am going to recommend it by putting honey on their lips, telling them that in it they will find one of the most famous kisses in the history of cinema: when John Wayne pulls Mauren O'Hara's arm with passionate intentions. Perhaps Mr. Thorton was a bit brusque, if we see it with the eyes of the 21st century, but it has become, without a doubt, a mythical scene of the seventh art since even Spielberg honors it in ET. Enjoy it and, please, put the offended in all of us lately to bed."

x_neg1 = "Drug dealers used to have the mantra “Don’t get high on your own supply.” Maybe movie stars should live by the credo “Dolittle — just don’t do it.” The 1998 reboot was merely another middling Eddie Murphy comedy, but this Robert Downey Jr. remake achieves the staggering feat of being much, much worse than the fabled, creaky-boned 1967 Hollywood musical debacle. Is the problem the charmless critters? The ungodly mess of a story? Or the mechanical whimsy of Downey, who barely talks to the animals because he’s so busy talking to himself? All of the above. “Dolittle” is a movie that’s more excruciating than the sum of its frenetic yet lifeless kiddie-blockbuster parts."

In [44]:
#https://www.elcineenlasombra.com/criticas-de-cine/

#fuente de criticas en español, traducidas al ingles

In [45]:
print("SVC ",clasificador_svc.predict([x_pos1, x_neg1]))

print("SGD ",clasificador_sgd.predict([x_pos1, x_neg1]))

print("RunForest ",clasificador_rundom_forest.predict([x_pos1, x_neg1]))

SVC  [1 0]
SGD  [1 0]
RunForest  [1 0]


Los resultados iniciales son correctos

Lo extrapolamos a opiniones generales, no necesariamente de peliculas...

Opiniones como por ejemplo experiencias vividas, compras, ventas, alquileres, canciones, etc...


experiencias positivas



experiencias negativas

https://www.lodgify.com/blog/es/opiniones-negativas/


In [46]:
x_neg_propiedades = "This property is nothing like your ad. For starters, the owners live downstairs with their two young children, which was not mentioned in the ad. As we were getting settled in the house, the children climbed the stairs to see what we were doing and to get our attention. One of them was too small to walk the stairs alone, so at one point Grandma came and got them out of there. I like children, but I don't want my hosts' children to bother me when I'm on vacation. Another thing to keep in mind is that they have a rule of not making noise after 22:40, which seems fine to me, but the apartment is offered to 5 adults, and it is almost impossible to be silent with so many people."
#bien



In [47]:
clasificador_svc.predict([x_neg_propiedades])

array([0])

## Conclusiones

Considerando la reseña original(cruda) y la misma reseña preprocesada, podemos llegar a la conclusion de que estas tecnicas mejoran los resultados.


Por mas que el preprocesamiento permita una reduccion de dimensionalidad de hasta un 50%, esa diferencia no se ve reflejada en los resultados finales. Aunque definitivamente agrega un par de puntos a los resultados finales de precision.


Para simplificar el proceso de generacion de clasificadores, las utilidades de sklearn permiten crear un pipe y parametrizarlo. 
En estos casos parametriza con un vectorizador con un preprocesador personalizado, un transformador y desde luego el clasificador. 
Organizando asi el flujo de datos aumentan un par de puntos la precision.


Estos 2 pŕocesos (el preprocesador personalizado y el pipe parametrizado) permiten mejorar los resultados finales de precision.  






Respecto a los clasificadores, 

Los Bayes presentan resultados aceptables. 
El Binomial es el que en teoria mejor se adapta, ya que hace clasificacion binaria, obteniendo resutados de aprox .83 ; 
El Multinomial si bien es para categorizar, aplicado a clasificacion binaria anda bien, con una precision de .80

El ensamblador RandomForest, que es un conjunto de arboles de decision, presenta buenos resutados, de una precision de .81

Los clasificadores que mejor funcionan son las maquinas de soporte de vectores SVM. 
El SGDClassifier devuelve los mejores resultados, de una precision de .85; 
El SVClassifier tambien tiene buenos resultados, con precision de un .80





Fuente

https://arxiv.org/abs/cs/0409058



### Trabajo futuro


Probar genericamente los clasificadores, con distintos tipos de opinion.

Reforzar el aprendizaje del algoritmo. A partir de los dominios (url) de opiniones/criticas, armar un pipe alimentado con el texto minado con BeautifulSoup(). 
