# Homework 3 - Sentiment Analysis

Este notebook presenta el entrenamiento y evaluación de modelos de análisis de sentimientos para el ´dataset Multi-domain Sentiment´ usando modelos de regresión lineal y logistica. Para esto, se usarán los modelos ´MultinomialNB´ y ´SGDClassifier´ de la libreria *scikit-learn*

## **Librerias**
Para la correcta ejecución de este notebook se usan la libreria scikit-learn. Además, se usa los lexicons de SenticNet5 para la contrucción de caracterisiticas

In [None]:
import pathlib
import pandas as pd


from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
#
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler


from  senticnet5 import senticnet

## **Variables globales**

Se realizará el analisis de sentimientos con el dataset Multi-domain Sentiment Dataset. Este dataset está compuesto de reseñas de artículos de Amazon.com previamente procesados. Las reseñas se encuentran tokenizadas en su representación de bolsa de palabras lo que facilita su procesamiento y posterior entrenamiento de los modelos de clasificación. Para esto, se carga los archivos del dataset en la carpeta processed_acl´

In [None]:
# constantes globales
CWD = pathlib.Path('.').absolute().resolve()

MDS_DATASET = CWD / 'processed_acl'

OPINION_POSITIVE_WORDS = CWD / 'opinion-lexicons' / 'positive-words.txt'
OPINION_NEGATIVE_WORDS = CWD / 'opinion-lexicons' / 'negative-words.txt'


assert MDS_DATASET.exists(), f'La carpeta {MDS_DATASET} no existe.'

assert OPINION_POSITIVE_WORDS.exists(), f'No se encontró el archivo {OPINION_POSITIVE_WORDS} no.'
assert OPINION_NEGATIVE_WORDS.exists(), f'No se encontró el archivo {OPINION_NEGATIVE_WORDS} no.'

## **Carga y preprocesamiento**

Se realiza carga de los datos, procesando cada archivo y finalmente construyendo un DataFrame con datos como label, features, domain y origen.

In [None]:
def parse_file_line(line:str) -> tuple[str, dict[str,int]]:
  """
  Procesa una reseña y devuelve sus features almacenadas en un diccionario y la etiqueta de la reseña ('positive' o 'negative')

  Parámetros
  ----------
  line: str
    Reseña en formato feature:count feature:count ...  #label#: positive

  Retorna
  -------
  tuple[str, dict[str,int]]
    Tupla que contiene la etiqueta de la reseña y sus features en un diccionario.

    Ej.
      (positive, {'feature':count, 'feature':count})

  """

  features, label = line.replace('\n','').split(' #label#:')

  feature_dict = {}
  for f in features.split(' '):
    feature, count = f.split(':')

    feature_dict[feature.strip()] = int(count)

  return label, feature_dict

In [None]:
def load_mds_dataset(extra_processing=None) -> pd.DataFrame:
  """
  Carga el dataset Multi-domain usando pd.DataFrame. Realiza un paso adicional a las features llamando a ´extra_processing´

  Parámetros
  ---------
  extra_processing: callable | None.
    Paso adicional de procesado de features.


  Retorna
  -------
  pd.DataFrame
    Dataframe con el dataset.
  """

  mds_dataset = []

  # carga de datos
  for dir, _, files in MDS_DATASET.walk():

    for file in files:
      filepath = dir/file
      with open(filepath,'r') as content:

        for line in content:
          label, feature_dict = parse_file_line(line)

          if extra_processing is not None:
            feature_dict = extra_processing(feature_dict)

          mds_dataset.append(
            {
              'label':label,
              'features': feature_dict,
              'domain': filepath.parent.name,
              'origin': filepath.stem
            }
          )

  mds_dataset = pd.DataFrame( mds_dataset )

  return mds_dataset

In [None]:
mds_dataset = load_mds_dataset()
mds_dataset.head()

Unnamed: 0,label,features,domain,origin
0,negative,"{'avid': 1, 'your': 1, 'horrible_book': 1, 'wa...",books,negative
1,negative,"{'to_use': 1, 'shallow': 1, 'found': 1, 'he_ca...",books,negative
2,negative,"{'avid': 1, 'your': 1, 'horrible_book': 1, 'wa...",books,negative
3,negative,"{'book_seriously': 1, 'we': 1, 'days_couldn't'...",books,negative
4,negative,"{'mass': 1, 'only': 1, 'he': 2, 'help': 1, '""j...",books,negative


## **Análisis de Sentimientos**

Para esta etapa se entrenarán modelos de regresión lineal y logistica para determinar la orientación de las reseñas ya sea positivas como negativas para cada uno de los dominios del dataset.

Para el entrenamiento y muestra de resultados se crea una funcion que contiene el pipeline que se usará en todo el notebook. Esta función permite modificar los aspectos necesarios para la ejecución de este notebook, tales como modelo a usar (naive bayes o logistic regression), dataset, dominio y etapa de procesamiento de vectorización de caracteristicas.

In [None]:
def train_and_eval(model, dataset, domain, pipeline_list) -> None:
  """ Entrena el modelo con el dataset dados. Filtra el dataset por dominio y aplica pipe_list al proceso de
    vectorizado.

    Parámetros:

    model:
      Modelo a entrenar.

    dataset:
      Conjunto de datos a usar en el entrenamiento. Debe contener tanto los datos de training, test y del
      dominio a usar.

    pipeline_list:
      Lista con los pasos a realizar para la creación del vectorizador de features.
      Este parametro se para a sklearn.pipeline.Pipeline
      Ej. DictVectorizer para conteo tf

    """



  # ----- train val test split
  training_filter = (dataset['domain'] == domain) & (dataset['origin'].isin(['positive','negative']))
  test_filter = (dataset['domain'] == domain) & (dataset['origin'] == 'unlabeled')

  # features
  training_features = dataset[ training_filter ]['features'].values
  test_features = dataset[ test_filter ]['features'].values

  # labels
  training_labels = dataset[ training_filter ]['label'].values
  test_labels = dataset[ test_filter ]['label'].values


  # Preprocesamiento de features (conversion a count o tfidf)
  vectorizer = Pipeline(pipeline_list).fit(training_features)

  training_features = vectorizer.transform(training_features)
  test_features = vectorizer.transform(test_features)



  model = model.fit(training_features, training_labels)
  predictions = model.predict(test_features)

  print(f'----------------------------------------------------------')
  print(f'--------- Resultados de clasificación para {domain} ---------')
  print(f'----------------------------------------------------------')
  print('Precisión: {:.5f}'.format(precision_score(test_labels, predictions,pos_label='positive')))
  print('Recall: {:.5f}'.format(recall_score(test_labels, predictions,pos_label='positive')))
  print('F1 score: {:.5f}'.format(f1_score(test_labels, predictions,pos_label='positive')))
  print('Accuracy: {:.5f}'.format(accuracy_score(test_labels, predictions)))
  print(f'----------------------------------------------------------\n')



## **Naive Bayes**



### **Representación TF**


In [None]:
nb_model = MultinomialNB()

pipe_list = [ ('dictVectorizer', DictVectorizer())
                    ]

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, mds_dataset, domain, pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.88786
Recall: 0.76237
F1 score: 0.82034
Accuracy: 0.83068
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.83163
Recall: 0.81184
F1 score: 0.82162
Accuracy: 0.82236
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.85749
Recall: 0.85299
F1 score: 0.85524
Accuracy: 0.85478
----------------------------------------------------------

----------------------------------------------------------
--------- Res

### **Representación TfIdf**

In [None]:
tfidf_pipe_list = [ ('dictVectorizer', DictVectorizer())
             ,('TfidfTransform', TfidfTransformer())
                    ]

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, mds_dataset, domain,tfidf_pipe_list)


----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.91269
Recall: 0.74337
F1 score: 0.81938
Accuracy: 0.83382
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.86733
Recall: 0.82125
F1 score: 0.84366
Accuracy: 0.84663
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.88832
Recall: 0.84914
F1 score: 0.86829
Accuracy: 0.87045
----------------------------------------------------------

----------------------------------------------------------
--------- Res

### **Representación por medio de Lexicons**

In [None]:
# Conversión de valores de SenticNet5 de string a float
new_senticnet = {}

for k in senticnet:
  new_list =[]

  for elem in senticnet[k]:
    try:
      elem = float(elem)
    except:
      pass

    new_list.append(elem)

  new_senticnet[k] = new_list

In [None]:
# Carga los lexicons de Opinion

# diccionario que mapea opinion lexicons a si es palabra positiva
# ej. opinion_lexicons['a+'] = True,  opinion_lexicons['2-faced'] = False
opinion_lexicons = {}

# load opinion lexicons
with open(OPINION_POSITIVE_WORDS,'r') as pos:

  for line in pos:
    if line.startswith(';') or line == '\n':
      continue
    else:
      opinion_lexicons[line.strip()] = True
# load opinion lexicons
with open(OPINION_NEGATIVE_WORDS,'r') as neg:
  for line in neg:
    if line.startswith(';') or line == '\n':
      continue
    else:
      opinion_lexicons[line.strip()] = False



opinion_lexicons['a+'], opinion_lexicons['2-faced']

(True, False)

#### Opinion Lexicons

In [None]:

def build_opinion_features(feature_dict):

  positive_word_count = negative_word_count = 0

  for feature in feature_dict:
    if '_' in feature:
      features = feature.split('_')
    else:
      features = [feature]

    for f in features:
      if f in opinion_lexicons:
        if opinion_lexicons[f]:
          positive_word_count += feature_dict[feature]
        else:
          negative_word_count += feature_dict[feature]

  positive_negative_ratio = positive_word_count / negative_word_count if negative_word_count > 0 else 1
  negative_positive_ratio = negative_word_count / positive_word_count if positive_word_count > 0 else 1


  return {
    'positive_words' : positive_word_count,
    'negative_words' : negative_word_count,
    'total_count':positive_word_count + negative_word_count,

    'positive_negative_ratio': positive_negative_ratio,
    'negative_positive_ratio': negative_positive_ratio
  }



In [None]:
opinion_mds_dataset = load_mds_dataset(build_opinion_features)

opinion_pipe_list = [ ('dictVectorizer', DictVectorizer()),
             ('standardScaler',StandardScaler(with_mean=False))
                    ]

nb_model = MultinomialNB()
for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, opinion_mds_dataset, domain, opinion_pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.68753
Recall: 0.74735
F1 score: 0.71619
Accuracy: 0.69966
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.72094
Recall: 0.76204
F1 score: 0.74092
Accuracy: 0.73146
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.72266
Recall: 0.80259
F1 score: 0.76053
Accuracy: 0.74582
----------------------------------------------------------

----------------------------------------------------------
--------- Res

In [None]:
opinion_mds_dataset = load_mds_dataset(build_opinion_features)

opinion_tfidf_pipe_list = [ ('dictVectorizer', DictVectorizer()),
             ('standardScaler',StandardScaler(with_mean=False)),
             ('TfidfTransform', TfidfTransformer())
                    ]

nb_model = MultinomialNB()
for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, opinion_mds_dataset, domain, opinion_tfidf_pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.69650
Recall: 0.72880
F1 score: 0.71228
Accuracy: 0.70146
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.73085
Recall: 0.74986
F1 score: 0.74023
Accuracy: 0.73480
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.72611
Recall: 0.77669
F1 score: 0.75055
Accuracy: 0.74036
----------------------------------------------------------

----------------------------------------------------------
--------- Res

#### SenticNet5 Lexicons

In [None]:
def build_lexicon_features(feature_dict):

  # features
  positive_pleasantness = positive_attention = positive_sensitivity = positive_aptitude =  0
  negative_pleasantness = negative_attention = negative_sensitivity = negative_aptitude =  0

  positive_polarity = negative_polarity = 0

  positive_polarity_features = negative_polarity_features = 0

  for k in feature_dict:

    if k in new_senticnet:

      feature_values = new_senticnet[k]
      feature_count = feature_dict[k]

      if feature_values[6] == 'positive':

        positive_pleasantness += feature_values[0] * feature_count if feature_values[0]>0 else 0
        positive_attention += feature_values[1] * feature_count if feature_values[1]>0 else 0
        positive_sensitivity += feature_values[2] * feature_count if feature_values[2]>0 else 0
        positive_aptitude +=  feature_values[3] * feature_count if feature_values[3]>0 else 0

        negative_pleasantness -= feature_values[0] * feature_count if feature_values[0]<0 else 0
        negative_attention -= feature_values[1] * feature_count if feature_values[1]<0 else 0
        negative_sensitivity -= feature_values[2] * feature_count if feature_values[2]<0 else 0
        negative_aptitude -=  feature_values[3] * feature_count if feature_values[3]<0 else 0

        positive_polarity += feature_values[7] * feature_count if feature_values[7]>0 else 0
        positive_polarity_features += feature_count if feature_values[6] == 'positive' else 0

        negative_polarity -= feature_values[7] * feature_count if feature_values[7]<0 else 0
        negative_polarity_features += feature_count if feature_values[6]=='negative' else 0


  return {
    'positive_pleasantness' : positive_pleasantness,
    'negative_pleasantness' : negative_pleasantness,

    'positive_attention': positive_attention,
    'negative_attention': negative_attention,

    'positive_sensitivity' : positive_sensitivity,
    'negative_sensitivity' : negative_sensitivity,

    'positive_aptitude': positive_aptitude,
    'negative_aptitude': negative_aptitude,

    'positive_polarity' : positive_polarity,
    'negative_polarity' : negative_polarity,

    'positive_polarity_features': positive_polarity_features,
    'negative_polarity_features': negative_polarity_features
  }

In [None]:
lexi_mds_dataset = load_mds_dataset(build_lexicon_features)

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, lexi_mds_dataset, domain, pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.56462
Recall: 0.60203
F1 score: 0.58273
Accuracy: 0.56282
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.55623
Recall: 0.60764
F1 score: 0.58080
Accuracy: 0.55800
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.58476
Recall: 0.63143
F1 score: 0.60720
Accuracy: 0.58916
----------------------------------------------------------

----------------------------------------------------------
--------- Res

In [None]:
lexi_mds_dataset = load_mds_dataset(build_lexicon_features)

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(nb_model, lexi_mds_dataset, domain, tfidf_pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.56038
Recall: 0.62102
F1 score: 0.58915
Accuracy: 0.56081
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.54483
Recall: 0.61538
F1 score: 0.57796
Accuracy: 0.54713
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.59080
Recall: 0.62513
F1 score: 0.60748
Accuracy: 0.59373
----------------------------------------------------------

----------------------------------------------------------
--------- Res

## **Linear Regression**

### **Representación Tf**

In [None]:
lr_model = SGDClassifier(loss='log_loss')

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(lr_model, mds_dataset, domain, pipe_list)

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.81153
Recall: 0.83304
F1 score: 0.82214
Accuracy: 0.81725
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.78990
Recall: 0.81350
F1 score: 0.80153
Accuracy: 0.79699
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.84114
Recall: 0.84879
F1 score: 0.84495
Accuracy: 0.84334
----------------------------------------------------------

----------------------------------------------------------
--------- Res

### **Representación Tf-idf**

In [None]:

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(lr_model, mds_dataset, domain, tfidf_pipe_list )

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.85784
Recall: 0.85292
F1 score: 0.85537
Accuracy: 0.85375
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.83805
Recall: 0.86774
F1 score: 0.85264
Accuracy: 0.84886
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.87694
Recall: 0.87049
F1 score: 0.87370
Accuracy: 0.87344
----------------------------------------------------------

----------------------------------------------------------
--------- Res

### **Optimización de hiperparámtetros**

In [None]:
param_grid = {
    'penalty': ['l2', 'l1','elasticnet'],
    'alpha': [5e-4, 1e-4, 1e-3, 1e-2],
    'max_iter': [1000, 2000],
    'learning_rate': ['optimal', 'adaptive','invscaling'],
    'eta0': [0.001, 0.01, 0.1, 0.5]
}

grid = GridSearchCV(
    estimator=SGDClassifier(loss='log_loss'),
    param_grid=param_grid,
    scoring='accuracy',
    cv=10,
    n_jobs=-1,
    verbose=2
)

In [None]:
def hyperparamter_optimization(domain,mds_dataset):
  training_filter = (mds_dataset['domain'] == domain) & (mds_dataset['origin'].isin(['positive','negative']))
  test_filter = (mds_dataset['domain'] == domain) & (mds_dataset['origin'] == 'unlabeled')

  # features
  training_features = mds_dataset[ training_filter ]['features'].values
  test_features = mds_dataset[ test_filter ]['features'].values

  # labels
  training_labels = mds_dataset[ training_filter ]['label'].values
  test_labels = mds_dataset[ test_filter ]['label'].values


  vectorizer = Pipeline(tfidf_pipe_list).fit(training_features)

  training_features = vectorizer.transform(training_features)
  test_features = vectorizer.transform(test_features)

  grid.fit(training_features, training_labels)

  print(f"Resultados de optimización de hiperparámetros para la categoria {domain}")
  print("Mejor score CV:", grid.best_score_)
  print("Parámetros del mejor modelo: ",grid.best_params_)
  print(f'Features mas importantes para la categoria {domain}')
  print(sorted(zip(grid.best_estimator_.coef_.tolist()[0],vectorizer.get_feature_names_out()),key=lambda t:t[0],reverse=True)[:10])

  model = grid.best_estimator_
  predictions = model.predict(test_features)

  print(f'----------------------------------------------------------')
  print(f'--------- Resultados de clasificación para {domain} con datos test ---------')
  print(f'----------------------------------------------------------')
  print('Precisión: {:.5f}'.format(precision_score(test_labels, predictions,pos_label='positive')))
  print('Recall: {:.5f}'.format(recall_score(test_labels, predictions,pos_label='positive')))
  print('F1 score: {:.5f}'.format(f1_score(test_labels, predictions,pos_label='positive')))
  print('Accuracy: {:.5f}'.format(accuracy_score(test_labels, predictions)))
  print(f'----------------------------------------------------------\n')



  return grid.best_estimator_

In [None]:
hyperparamter_optimization('books', mds_dataset)

Fitting 10 folds for each of 288 candidates, totalling 2880 fits
Resultados de optimización de hiperparámetros para la categoria books
Mejor score CV: 0.834
Parámetros del mejor modelo:  {'alpha': 0.0001, 'eta0': 0.1, 'learning_rate': 'adaptive', 'max_iter': 1000, 'penalty': 'l2'}
Features mas importantes para la categoria books
[(3.9763309827317204, 'great'), (3.897944129246496, 'excellent'), (2.626891659192025, 'wonderful'), (2.5902157508028925, 'easy'), (2.5513893413992497, 'you'), (2.520472229927127, 'my'), (2.3770133502275685, 'the_best'), (2.3760387180547307, 'loved'), (2.3304697092707403, 'love'), (2.3173142294371676, 'best')]
----------------------------------------------------------
--------- Resultados de clasificación para books con datos test ---------
----------------------------------------------------------
Precisión: 0.86979
Recall: 0.82906
F1 score: 0.84894
Accuracy: 0.85039
----------------------------------------------------------



0,1,2
,loss,'log_loss'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [None]:
hyperparamter_optimization('dvd', mds_dataset)

Fitting 10 folds for each of 288 candidates, totalling 2880 fits
Resultados de optimización de hiperparámetros para la categoria dvd
Mejor score CV: 0.8470000000000001
Parámetros del mejor modelo:  {'alpha': 0.0001, 'eta0': 0.5, 'learning_rate': 'adaptive', 'max_iter': 1000, 'penalty': 'l2'}
Features mas importantes para la categoria dvd
[(5.640602420001845, 'great'), (4.238896239127745, 'best'), (3.4887409908210953, 'excellent'), (3.3618546445707183, 'love'), (3.081223935517021, 'his'), (2.859445643513621, 'the_best'), (2.7926679278944295, 'a_great'), (2.6698845411570695, 'well'), (2.553246090884828, 'wonderful'), (2.513620085175795, 'who')]
----------------------------------------------------------
--------- Resultados de clasificación para dvd con datos test ---------
----------------------------------------------------------
Precisión: 0.84230
Recall: 0.86608
F1 score: 0.85402
Accuracy: 0.85081
----------------------------------------------------------



0,1,2
,loss,'log_loss'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [None]:
hyperparamter_optimization('electronics', mds_dataset)


Fitting 10 folds for each of 288 candidates, totalling 2880 fits
Resultados de optimización de hiperparámetros para la categoria electronics
Mejor score CV: 0.869
Parámetros del mejor modelo:  {'alpha': 0.0001, 'eta0': 0.001, 'learning_rate': 'optimal', 'max_iter': 1000, 'penalty': 'l2'}
Features mas importantes para la categoria electronics
[(7.824988422916189, 'great'), (4.7540155886918125, 'price'), (4.700183536536345, 'excellent'), (3.753747538582936, 'perfect'), (3.6191804538997814, 'good'), (3.5156781047193135, 'works'), (3.3615098736380338, 'best'), (2.9425579740324235, 'easy'), (2.881071790151266, 'the_best'), (2.5223486773243056, 'the_price')]
----------------------------------------------------------
--------- Resultados de clasificación para electronics con datos test ---------
----------------------------------------------------------
Precisión: 0.86058
Recall: 0.88799
F1 score: 0.87407
Accuracy: 0.87133
----------------------------------------------------------



0,1,2
,loss,'log_loss'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


In [None]:
hyperparamter_optimization('kitchen', mds_dataset)

Fitting 10 folds for each of 288 candidates, totalling 2880 fits
Resultados de optimización de hiperparámetros para la categoria kitchen
Mejor score CV: 0.899
Parámetros del mejor modelo:  {'alpha': 0.0001, 'eta0': 0.001, 'learning_rate': 'optimal', 'max_iter': 1000, 'penalty': 'l2'}
Features mas importantes para la categoria kitchen
[(7.723351014298935, 'great'), (5.7782414167605785, 'easy'), (5.0179812665192935, 'love'), (4.413671233734823, 'easy_to'), (4.153159822041551, 'best'), (3.9163966140308846, 'excellent'), (3.572002215219192, 'perfect'), (3.2761222730420467, 'works'), (2.995779167757533, 'my'), (2.86405390552586, 'little')]
----------------------------------------------------------
--------- Resultados de clasificación para kitchen con datos test ---------
----------------------------------------------------------
Precisión: 0.89243
Recall: 0.88186
F1 score: 0.88711
Accuracy: 0.88848
----------------------------------------------------------



0,1,2
,loss,'log_loss'
,penalty,'l2'
,alpha,0.0001
,l1_ratio,0.15
,fit_intercept,True
,max_iter,1000
,tol,0.001
,shuffle,True
,verbose,0
,epsilon,0.1


### **Representación usando Lexicons**

In [None]:
lr_model = SGDClassifier(loss='log_loss')
lexi_mds_dataset = load_mds_dataset(build_lexicon_features)

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(lr_model, lexi_mds_dataset, domain, tfidf_pipe_list )

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.55059
Recall: 0.66343
F1 score: 0.60176
Accuracy: 0.55476
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.50595
Recall: 0.98783
F1 score: 0.66917
Accuracy: 0.50781
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.62780
Recall: 0.60868
F1 score: 0.61809
Accuracy: 0.62172
----------------------------------------------------------

----------------------------------------------------------
--------- Res

In [None]:
lr_model = SGDClassifier(loss='log_loss')
opinion_mds_dataset = load_mds_dataset(build_opinion_features)

for domain in mds_dataset['domain'].unique().tolist():
  train_and_eval(lr_model, lexi_mds_dataset, domain, tfidf_pipe_list )

----------------------------------------------------------
--------- Resultados de clasificación para books ---------
----------------------------------------------------------
Precisión: 0.54317
Recall: 0.79196
F1 score: 0.64438
Accuracy: 0.55677
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para dvd ---------
----------------------------------------------------------
Precisión: 0.59964
Recall: 0.37299
F1 score: 0.45991
Accuracy: 0.55856
----------------------------------------------------------

----------------------------------------------------------
--------- Resultados de clasificación para electronics ---------
----------------------------------------------------------
Precisión: 0.60672
Recall: 0.69548
F1 score: 0.64808
Accuracy: 0.62014
----------------------------------------------------------

----------------------------------------------------------
--------- Res