# Pipeline comprimido - Proyecto 1 Etapa 2 G32
El propósito de este archivo es crear el pipeline, con los fragmentos de código claves de la exploración en la etapa 1. De aquí se obtendrá el .joblib que se utilizará en el back.

## Actualizaciones, instalaciones e importaciones

In [8]:
import joblib
!python -m pip install --upgrade pip
!pip install --upgrade setuptools
!pip install -r ../requirements.txt

# Instalación de librerias
import nltk
import pandas as pd
import numpy as np
import sys
import re, string, unicodedata
import contractions
import inflect
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin
import matplotlib.pyplot as plt
import stanza
from copy import deepcopy
from math import floor

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alvar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alvar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\alvar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Lectura y preparación de datos

In [9]:
data=pd.read_csv('../data/raw/tipo2_entrenamiento_estudiantes.csv', sep=',', encoding = 'utf-8')
# Asignación a una nueva variable de los datos leidos
data_t=deepcopy(data)

In [10]:
def atypical_length(df, alfa, beta):
    """
    Elimina las reseñas con un bajo número de palabras dado que no son dicientes
    
    :param df: (DataFrame) reseñas y califiaciones de los usuarios
    :param alfa: (float) porcentaje en cola izquierda excluido
    :param alfa: (float) porcentaje en cola derecha excluido
    :return: (DataFrame) elimina las entradas con menos palabras que el umbral dado
    """
    
    df_longitudes = df.copy()
    df_longitudes['Conteo'] = [len(x.split()) for x in df_longitudes['Review']]
    
    upper = df_longitudes['Conteo'].quantile(1-beta)
    lower = df_longitudes['Conteo'].quantile(alfa)
    
    chosen = df_longitudes[df_longitudes['Conteo'] > lower]
    chosen = chosen[chosen['Conteo'] < upper]
    
    #print("Mínimo número palabras: ", min(chosen['Conteo']))
    #print("Máximo número palabras: ", max(chosen['Conteo']))
    
    chosen = chosen.drop(columns=['Conteo'])
    
    return chosen


def remove_duplicates(df):
    """
    Elimina las reseñas duplicadas en la columna Review
    
    :param df: dataframe a manipular
    :return: 
    """
    return df.drop_duplicates()


# Entrenamientos

In [11]:
def load_processed_data(file_name, frac=1):
    """
    Carga el archivo de datos procesados y los
    procesa para el entrenamiento
    
    :param file_name: nombre del archivo a cargar. Debe ser .csv, pero no se pasa la extensión
    :param frac: tamaño del dataset que se carga como porcentaje entre 0.00 y 1.00
    :return: train set y test set separados en features y objetivo
    """
    
    data_p=pd.read_csv('../data/processed/' + file_name + '.csv', sep=';', encoding = 'utf-8')
    # Asignación a una nueva variable de los datos leidos
    data_p = data_p.sample(n=floor(frac*data_p.shape[0]))
    print("Número de datos encontrados:", data_p.shape)
    
    return data_p
    
    
def separate_data(data_p, test_size):
    """
    Separa el dataframe en train y test sets
    
    :param data: dataframe cargado
    :param test_size: tamaño de los tests sets en comparación con los datos totales
    :return: xtr, xts, ytr, yts
    """
    x = data_p.iloc[:,:-1]
    y = data_p.iloc[:,-1]
    
    xtr, xts, ytr, yts = train_test_split(x, y, test_size=test_size, random_state=42069)
    
    print("X train set:", xtr.shape)
    print("X test  set:", xts.shape)
    print("Y train set:", ytr.shape)
    print("Y test  set:", yts.shape)
    
    return xtr, xts, ytr, yts


def custom_sampling(df, objective_distribution):
    """
    Hace un muestreo personalizado de las cinco clases. 
    Las que estén subrepresentadas se sobremuestruean y las
    que estén sobre representadas se submuestrean
    
    :param df: dataframe de datos procesados
    :return: dataframe con distribución uniforme
    """
    total_count = df.shape[0]
    
    for cls, per in objective_distribution.items():
        class_df = df[df['Class'] == cls]
        desired_count = int(total_count * per)
        
        resample_df = class_df.sample(desired_count, replace=True)
        # Combine with the existing DataFrame
        df = pd.concat([df[df['Class'] != cls], resample_df], ignore_index=True)
            
    return df


def uniform_sampling(df):
    """
    Hace un muestreo uniforme de las cinco clases. 
    Las que estén subrepresentadas se sobremuestruean y las
    que estén sobrerepresentadas se submuestrean
    :param df: dataframe de datos procesados
    :return: dataframe con distribución uniforme
    """
    # Undersampled classes and oversampled classes
    
    # Sample under-represented classes to reach a desired count (adjust count as needed)
    desired_count = df.shape[0]//5  # Match the count of correctly sampled class
    
    objective_values = set(df['Class'].to_list())
    objective_distribution = dict
    
    for v in objective_values:
        objective_distribution[v] = 1//len(objective_values)
    
    return custom_sampling(df, objective_distribution)

In [12]:
def train_nb(xtr, xts, ytr, yts, nbchoice):
    """
    Entrena un Bayes Ingenuo dado
    
    :param xtr: features de entrenamiento
    :param xts: features de test
    :param ytr: var. objetivo de entrenamiento
    :param yts: var. objetivo de test
    :param nbchoice: modelo de Bayes Ingenuo elegido
    :return: el modelo Bayes Ingenuo entrenado
    """
    nb = nbchoice()
    
    # entrena el modelo
    nb.fit(xtr, ytr)
    
    # predice para el test set
    ypred = nb.predict(xts)
    
    print(classification_report(yts, ypred, target_names=['1','2','3','4','5']))
    
    return nb

In [13]:
objective_distribution = {
        1 : 0.170,
        2 : 0.170,
        3 : 0.235,
        4 : 0.240,
        5 : 0.205
    }
# df = custom_sampling(dt_vectorized, objective_distribution)
# xtr, xts, ytr, yts = separate_data(df, 0.20)
# mnb = train_nb(xtr, xts, ytr, yts, MultinomialNB)

In [14]:
# gnb = train_nb(xtr, xts, ytr, yts, GaussianNB)

In [15]:
from sklearn.metrics import ConfusionMatrixDisplay

# 
# def trainSVM(xtr, xts, ytr, yts):
#     """
#     Entrena un modelo con SVM y lo prueba, sin usar pipeline.
# 
#     :param xtr: features de entrenamiento
#     :param xts: features de test
#     :param ytr: var. objetivo de entrenamiento
#     :param yts: var. objetivo de test
#     :return: el modelo de SVM entrenado
#     """
# 
#     svm = SVC(random_state=0)
#     param_grid = {
#         'C': [0.1, 1, 10, 100],  # Parámetro de regularización
#         'kernel': ['linear', 'rbf'],  # Linear y RBF son los kernels más comunes
#     }
#     grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
#     grid_search.fit(xtr, ytr)
#     best_params = grid_search.best_params_
#     print("Mejores parámetros:", best_params)
#     # Crear y entrenar el modelo con los mejores parámetros encontrados
#     best_model = SVC(**best_params, random_state=0)
#     best_model.fit(xtr, ytr)
# 
#     # Predecir con el conjunto de test escalado
#     ypred = best_model.predict(xts)
#     print(classification_report(yts, ypred, target_names=['1', '2', '3', '4', '5']))
#     cm = confusion_matrix(yts, ypred, labels=[1, 2, 3, 4, 5])
#     disp = ConfusionMatrixDisplay(cm, display_labels=[1,2,3,4,5])
#     
#     # Plot the confusion matrix with colors
#     fig, ax = plt.subplots(figsize=(8, 8))
#     disp.plot(cmap=plt.cm.Blues, ax=ax)
#     
#     # Add a colorbar
#     plt.colorbar(disp.im_, ax=ax)
#     
#     # Add title and labels
#     ax.set_title('Confusion Matrix with Colors')
#     ax.set_xlabel('Predicted Labels')
#     ax.set_ylabel('True Labels')
#     
#     # Show the plot
#     plt.show()
#     
#     return best_model


In [16]:
# objective_distribution = {
#         1 : 0.201,
#         2 : 0.201,
#         3 : 0.180,
#         4 : 0.194,
#         5 : 0.190
#     }
# 
# df = custom_sampling(dt_vectorized, objective_distribution)
# xtr, xts, ytr, yts = separate_data(df, 0.20)
# svm_model = trainSVM(xtr, xts, ytr, yts)

# Forma Pipeline

In [17]:
from sklearn.base import TransformerMixin


class Preprocessing(BaseEstimator, TransformerMixin):
    
    def __init__(self, vocabulary=None):
        if vocabulary is not None:
            print("Vocabulary given")
            self.feature_vectorizer_algorithm = TfidfVectorizer (
                decode_error='ignore',
                strip_accents='ascii',
                analyzer='word',
                vocabulary=vocabulary
            )
        else:
            print("Generating new vocabulary")
            self.feature_vectorizer_algorithm = TfidfVectorizer (
                decode_error='ignore',
                strip_accents='ascii',
                analyzer='word',
                max_features=10000
            )
            
        print("Inicializando preprocessing...")
        # algoritmo de vectorización de features
        
        # features
        self.feature_names = None
        # stanza pipeline
        self.stanza_pipeline = stanza.Pipeline(lang="es", processors="tokenize,mwt,pos,lemma")
        
        
    def fit(self, X, y=None):
        return self
    
    
    def stanza_preprocessing(self, words):
        """
        Uses the Stanza Pipeline to preprocess text
        Recommended before cleaning as stopword eliminations, lower-casing
        and punctuation removal affect POS and word tagging for lemma resolution
        
        :param words: (list) unclean words passed through the pipeline
        :param pipe: (stanza.Pipeline) stanza Pipeline used
        :return: (list) lemmas obtained
        """
        doc = self.stanza_pipeline(words)
        lemmas = [w.lemma for w in doc.sentences[0].words]
        return lemmas
        
    
    def remove_non_ascii(self, words):
        """
        Remueve caractéres no ASCII de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras/strings 'ASCII-zadas'
        """
        # TODO: reconsiderar entre ASCII y UTF-8
        new_words = []
        for word in words:
            if word is not None:
                # debe ser codificado en UTF-8 para no obstaculizar al lematizador
              new_word = unicodedata.normalize('NFKD', word).encode('utf-8', 'ignore').decode('utf-8', 'ignore')
              new_words.append(new_word)
        return new_words
    
    
    def to_lowercase(self, words):
        """
        Convierte todos los caracteres a minúscula de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras en minúscula
        """
        return [w.lower() for w in words]
    
    
    def remove_punctuation(self, words):
        """
        Remove punctuation from list of tokenized words
        
        :param words: (list) lista de palabas
        :returns: (list) lista de palabras con puntuación removida
        """
        new_words = []
        for word in words:
            if word is not None:
                new_word = re.sub(r'[^\w\s]', '', word)
                if new_word != '':
                    new_words.append(new_word)
        return new_words
    
    
    def replace_numbers(self, words):
       """Replace all interger occurrences in list of tokenized words with textual representation"""
       p = inflect.engine()
       print(words)
       new_words = []
       for word in words:
           if word.isdigit():
               new_word = p.number_to_words(word)
               new_words.append(new_word)
               print("if " + new_word)
           else:
               new_words.append(word)
       return new_words
    
    
    def remove_stopwords(self, words):
        """
        Remueve las stop words de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras sin stop words
        """
        
        languages = ['spanish']
        stopword = nltk.corpus.stopwords.words(languages)
        
        # in case of only spanish, contractions are included as stopwords
        return [w for w in words if w not in stopword]
        
        
    def separate_contractions(self, words):
        """
        Elimina las contracciones. Cubre inglés, las contracciones en español están cubiertas
        con las stopwords
        
        :param words: lista de palabras tokenizadas
        :return: lista de palabras sin contracciones
        """
        return words.apply(contractions.fix)
    
    
    # se renombra el preprocessing como cleaning
    def cleaning(self, words):
        words = self.to_lowercase(words)
    #    words = self.replace_numbers(words)
        words = self.remove_punctuation(words)
        words = self.remove_non_ascii(words)
        words = self.remove_stopwords(words)
    #    words = self.separate_contractions(words)
        return words
    
    
    def vectorize(self, X):
        """
        Vectoriza las features utilizando el algoritmo provisto
        
        :param data: data frame de datos procesados
        :param feature_vectorizer_algorithm: algoritmo de vectorización a utilizar para features
        :return: (lista de números correspondientes a la vectorización, arreglo de features)
        """
        X['Review'] = X['Review'].apply(lambda x: ' '.join(map(str, x)))
        
        x_data = X['Review']
        
        x_data_vectorized_matrix = self.feature_vectorizer_algorithm.fit_transform(x_data)
        x_data_vectorized_df = pd.DataFrame(x_data_vectorized_matrix.toarray())  # ... for additional features from csr_matrix
        
        # obtiene el arreglo de palabras con columnas
        self.feature_names = self.feature_vectorizer_algorithm.get_feature_names_out()
        
        res = pd.concat([x_data_vectorized_df], axis=1)
        
        return res

    
    def transform(self, X):        
        # tokenización + lematización Stanza
        print("Stanza preprocessing... ", end="")
        X['Review'] = X['Review'].apply(self.stanza_preprocessing)
        print("OK", X.shape)
        
        # limpieza
        print("Additional cleaning... ", end="")
        X['Review'] = X['Review'].apply(self.cleaning)
        print("OK", X.shape)
        
        # vectorización de features
        print("Feature and objective vectorization... ", end="")
        # se guarda de esta manera para no afectar el pipeline. Se podría analizar una manera más ortodoxa de hacer
        X = self.vectorize(X)
        
        # pueden quedar columnas libres por eliminación de stopwords. Se llenan con 0s
        X = X.fillna(0)
        # valores en clase se vuelven floats, toca retornar a int
        # X['Class'] = X['Class'].apply(lambda x: int(x) if pd.notnull(x) else x)
        # # valores de 0 en clase son erróneos
        # X = X[X['Class'] != 0]
        
        print("OK", X.shape)
        
        return X
        

In [18]:
class Process(BaseEstimator, TransformerMixin):
    def __init__(self):        
        # features
        self.feature_names = None
        # stanza pipeline
        self.stanza_pipeline = stanza.Pipeline(lang="es", processors="tokenize,mwt,pos,lemma")
        
        
    def fit(self, X, y=None):
        return self
    
    
    def stanza_preprocessing(self, words):
        """
        Uses the Stanza Pipeline to preprocess text
        Recommended before cleaning as stopword eliminations, lower-casing
        and punctuation removal affect POS and word tagging for lemma resolution
        
        :param words: (list) unclean words passed through the pipeline
        :param pipe: (stanza.Pipeline) stanza Pipeline used
        :return: (list) lemmas obtained
        """
        doc = self.stanza_pipeline(words)
        lemmas = [w.lemma for w in doc.sentences[0].words]
        return lemmas
        
    
    def remove_non_ascii(self, words):
        """
        Remueve caractéres no ASCII de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras/strings 'ASCII-zadas'
        """
        # TODO: reconsiderar entre ASCII y UTF-8
        new_words = []
        for word in words:
            if word is not None:
                # debe ser codificado en UTF-8 para no obstaculizar al lematizador
              new_word = unicodedata.normalize('NFKD', word).encode('utf-8', 'ignore').decode('utf-8', 'ignore')
              new_words.append(new_word)
        return new_words
    
    
    def to_lowercase(self, words):
        """
        Convierte todos los caracteres a minúscula de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras en minúscula
        """
        return [w.lower() for w in words]
    
    
    def remove_punctuation(self, words):
        """
        Remove punctuation from list of tokenized words
        
        :param words: (list) lista de palabas
        :returns: (list) lista de palabras con puntuación removida
        """
        new_words = []
        for word in words:
            if word is not None:
                new_word = re.sub(r'[^\w\s]', '', word)
                if new_word != '':
                    new_words.append(new_word)
        return new_words
    
    
    def replace_numbers(self, words):
       """Replace all interger occurrences in list of tokenized words with textual representation"""
       p = inflect.engine()
       print(words)
       new_words = []
       for word in words:
           if word.isdigit():
               new_word = p.number_to_words(word)
               new_words.append(new_word)
               print("if " + new_word)
           else:
               new_words.append(word)
       return new_words
    
    
    def remove_stopwords(self, words):
        """
        Remueve las stop words de la lista de palabras tokenizadas
        
        :param words: (list) lista de palabras tokenizadas
        :returns: (list) lista de palabras sin stop words
        """
        
        languages = ['spanish']
        stopword = nltk.corpus.stopwords.words(languages)
        
        # in case of only spanish, contractions are included as stopwords
        return [w for w in words if w not in stopword]
        
        
    def separate_contractions(self, words):
        """
        Elimina las contracciones. Cubre inglés, las contracciones en español están cubiertas
        con las stopwords
        
        :param words: lista de palabras tokenizadas
        :return: lista de palabras sin contracciones
        """
        return words.apply(contractions.fix)
    
    
    # se renombra el preprocessing como cleaning
    def cleaning(self, words):
        words = self.to_lowercase(words)
    #    words = self.replace_numbers(words)
        words = self.remove_punctuation(words)
        words = self.remove_non_ascii(words)
        words = self.remove_stopwords(words)
    #    words = self.separate_contractions(words)
        return words

    
    def transform(self, X):        
        # tokenización + lematización Stanza
        print("Stanza preprocessing... ", end="")
        X['Review'] = X['Review'].apply(self.stanza_preprocessing)
        print("OK", X.shape)
        
        # limpieza
        print("Additional cleaning... ", end="")
        X['Review'] = X['Review'].apply(self.cleaning)
        
        print("OK", X.shape)
        
        return X
    

In [19]:
def training_preprocessing(data):
    alfa = 0.05
    beta = 0.05
    
    df = deepcopy(data)
    
    # eliminación de nulos
    print("Null elimination... ", end="")
    df = df.dropna()
    print("OK", df.shape)
    
    # eliminación de duplicados
    print("Duplicate elimination... ", end="")
    df = remove_duplicates(df)
    print("OK", df.shape)
    
    # eliminación de longitudes atípicas
    print("Ayptical length elimination... ", end="")
    df = atypical_length(df, alfa, beta)
    print("OK", df.shape)
    
    return df

In [20]:
class SVM(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        """
        Entrena un modelo con SVM y lo prueba, sin usar pipeline.
    
        :param xtr: features de entrenamiento
        :param xts: features de test
        :param ytr: var. objetivo de entrenamiento
        :param yts: var. objetivo de test
        :return: el modelo de SVM entrenado
        """
    
        self.svm = SVC(random_state=0)
        self.param_grid = {
            'C': [0.1, 1, 10, 100],  # Parámetro de regularización
            'kernel': ['linear', 'rbf'],  # Linear y RBF son los kernels más comunes
        }
        self.cv = 5
        self.scoring = 'accuracy'
        
    # def fit(self, X):
    #     grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
    #     grid_search.fit(xtr, ytr)
    #     best_params = grid_search.best_params_
    #     print("Mejores parámetros:", best_params)
    #     # Crear y entrenar el modelo con los mejores parámetros encontrados
    #     best_model = SVC(**best_params, random_state=0)
    #     best_model.fit(xtr, ytr)
    # 
    #     # Predecir con el conjunto de test escalado
    #     ypred = best_model.predict(xts)
    #     print(classification_report(yts, ypred, target_names=['1', '2', '3', '4', '5']))
    #     cm = confusion_matrix(yts, ypred, labels=[1, 2, 3, 4, 5])
    #     disp = ConfusionMatrixDisplay(cm, display_labels=[1,2,3,4,5])
    #     
    #     # Plot the confusion matrix with colors
    #     fig, ax = plt.subplots(figsize=(8, 8))
    #     disp.plot(cmap=plt.cm.Blues, ax=ax)
    #     
    #     # Add a colorbar
    #     plt.colorbar(disp.im_, ax=ax)
    #     
    #     # Add title and labels
    #     ax.set_title('Confusion Matrix with Colors')
    #     ax.set_xlabel('Predicted Labels')
    #     ax.set_ylabel('True Labels')
    #     
    #     # Show the plot
    #     plt.show()
    #     
    #     return best_model
        

In [21]:
from joblib import parallel_backend

svc_param_grid={
    'C': [0.1, 1, 10, 100],  # Parámetro de regularización
    'kernel': ['linear', 'rbf'],  # Linear y RBF son los kernels más comunes
}

pipeline = Pipeline(steps = [
    ('preprocessing', Preprocessing()),
    ('classifier', GridSearchCV(SVC(), param_grid=svc_param_grid, cv=5, scoring='accuracy', n_jobs=None, verbose=2))
])

2024-04-19 19:45:15 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Generating new vocabulary
Inicializando preprocessing...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-19 19:45:16 INFO: Downloaded file to C:\Users\alvar\stanza_resources\resources.json
2024-04-19 19:45:16 INFO: Loading these models for language: es (Spanish):
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |

2024-04-19 19:45:16 INFO: Using device: cpu
2024-04-19 19:45:16 INFO: Loading: tokenize
2024-04-19 19:45:18 INFO: Loading: mwt
2024-04-19 19:45:18 INFO: Loading: pos
2024-04-19 19:45:19 INFO: Loading: lemma
2024-04-19 19:45:19 INFO: Done loading processors!


In [22]:
df = pd.read_csv('../data/raw/tipo2_entrenamiento_estudiantes.csv', sep=',', encoding = 'utf-8')
df = df.drop_duplicates()
df.shape

(7802, 2)

In [23]:
df = pd.read_csv('../data/raw/tipo2_entrenamiento_estudiantes.csv', sep=',', encoding = 'utf-8')

objective_distribution = {
        1 : 0.201,
        2 : 0.201,
        3 : 0.180,
        4 : 0.194,
        5 : 0.190
}
# aún funciona, pero no sabemos si deberíamos usar todo el dataset
#df_sampled = custom_sampling(df, objective_distribution)
df_sampled = df.sample(frac=0.05)
df_pp = training_preprocessing(df_sampled)

xtr, xts, ytr, yts = separate_data(df_pp, 0.20)


Null elimination... OK (394, 2)
Duplicate elimination... OK (394, 2)
Ayptical length elimination... OK (353, 2)
X train set: (282, 1)
X test  set: (71, 1)
Y train set: (282,)
Y test  set: (71,)


In [24]:
pp = Preprocessing(vocabulary=['hola'])

2024-04-19 19:45:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Vocabulary given
Inicializando preprocessing...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-19 19:45:19 INFO: Downloaded file to C:\Users\alvar\stanza_resources\resources.json
2024-04-19 19:45:20 INFO: Loading these models for language: es (Spanish):
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |

2024-04-19 19:45:20 INFO: Using device: cpu
2024-04-19 19:45:20 INFO: Loading: tokenize
2024-04-19 19:45:20 INFO: Loading: mwt
2024-04-19 19:45:20 INFO: Loading: pos
2024-04-19 19:45:20 INFO: Loading: lemma
2024-04-19 19:45:20 INFO: Done loading processors!


In [25]:
print(xtr.shape)
print(ytr.shape)
print(type(xtr))
print(type(ytr))

(282, 1)
(282,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [26]:
# pipeline.fit(xtr, ytr)

In [27]:
# print(xts.shape)
# print(yts.shape)

In [28]:
# pipeline.score(xts, yts)

In [29]:
# features = pipeline[0].feature_vectorizer_algorithm.get_feature_names_out()

In [30]:
# best_params = pipeline[1].best_params_
# best_params

In [31]:
# calibrated_pipeline = Pipeline(steps = [
#     ('preprocessing', Preprocessing(features)),
#     ('classifier', SVC(**best_params))
# ])

In [32]:
# pred = pd.DataFrame({'Review': ['bonito']})
# pred

In [33]:
# # aún funciona, pero no sabemos si deberíamos usar todo el dataset
# #df_sampled = custom_sampling(df, objective_distribution)
# df_sampled = df.sample(frac=0.05)
# df_pp = training_preprocessing(df_sampled)
# # 
# xtr, xts, ytr, yts = separate_data(df_pp, 0.20)
# calibrated_pipeline.fit(xtr, xts)

In [34]:
# calibrated_pipeline.predict(pred)

In [35]:
# pipeline[0].feature_vectorizer_algorithm.get_feature_names_out()

In [36]:
# pipeline[0].feature_vectorizer_algorithm.vocabulary_

In [37]:
svc_param_grid={
    'C': [0.1, 1, 10, 100],  # Parámetro de regularización
    'kernel': ['linear', 'rbf'],  # Linear y RBF son los kernels más comunes
}

train_pipe = Pipeline(steps = [
    ('lemmatize', Process()),
    ('vectorize', TfidfVectorizer(
        decode_error='ignore',
        strip_accents='ascii',
        analyzer='word',
        max_features=10000
    )),
    ('classifier', GridSearchCV(SVC(), param_grid=svc_param_grid, cv=5, scoring='accuracy', n_jobs=None, verbose=2))
])

2024-04-19 19:45:21 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-19 19:45:21 INFO: Downloaded file to C:\Users\alvar\stanza_resources\resources.json
2024-04-19 19:45:21 INFO: Loading these models for language: es (Spanish):
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |

2024-04-19 19:45:21 INFO: Using device: cpu
2024-04-19 19:45:21 INFO: Loading: tokenize
2024-04-19 19:45:21 INFO: Loading: mwt
2024-04-19 19:45:21 INFO: Loading: pos
2024-04-19 19:45:22 INFO: Loading: lemma
2024-04-19 19:45:22 INFO: Done loading processors!


In [38]:
df = pd.read_csv('../data/raw/tipo2_entrenamiento_estudiantes.csv', sep=',', encoding = 'utf-8')
df_sampled = df.sample(frac=0.05)
df_pp = training_preprocessing(df_sampled)
xtr, xts, ytr, yts = separate_data(df_pp, 0.20)

Null elimination... OK (394, 2)
Duplicate elimination... OK (394, 2)
Ayptical length elimination... OK (348, 2)
X train set: (278, 1)
X test  set: (70, 1)
Y train set: (278,)
Y test  set: (70,)


In [39]:
type(xtr)

pandas.core.frame.DataFrame

In [40]:
res = train_pipe[0].transform(xtr)

Stanza preprocessing... OK (278, 1)
Additional cleaning... OK (278, 1)


In [41]:
res[:10]

Unnamed: 0,Review
801,"[5, principal, plaza, habana, vieja]"
4050,"[excelente, servicio, francisco, garcia, rapid..."
2762,"[ser, primero, vez, hospedir, hotel, ibi, pens..."
6720,"[buen, tarde, alojar, hotel, hacer, 6, año, e..."
4365,"[deteriorar, zona]"
3707,"[excelente, servicio, comida, exquisito, super..."
2087,"[recomendar, lugar, siempre, lleno, él, hacer..."
5088,"[ser, corazon, caminito, sumergido, cultura, a..."
6680,"[más, 100, bar, caribe, cuba, bahamas, miami,..."
7854,[defraudar]


In [42]:
res.shape

(278, 1)

In [63]:
from copy import copy

data_stringfied = copy(res)
data_stringfied['Review'] = data_stringfied['Review'].apply(lambda x: ' '.join(map(str, x)))
data_stringfied['Review'][801]

'5 principal plaza habana vieja'

In [44]:
train_pipe[1]

In [66]:
res2 = train_pipe[1].fit_transform(data_stringfied['Review'])
res2

<278x1229 sparse matrix of type '<class 'numpy.float64'>'
	with 3126 stored elements in Compressed Sparse Row format>

In [67]:
train_pipe[1].get_feature_names_out()

array(['10', '100', '11', ..., 'yumka', 'zapato', 'zona'], dtype=object)

In [68]:
res2_df = pd.DataFrame(res2.toarray())
res2_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.580497
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


In [69]:
train_pipe[2]

In [70]:
res3 = train_pipe[2].fit(res2, ytr)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ...............................C=0.1, kernel=linear; total time=   0.0s
[CV] END ...............................C=0.1, kernel=linear; total time=   0.0s
[CV] END ...............................C=0.1, kernel=linear; total time=   0.0s
[CV] END ...............................C=0.1, kernel=linear; total time=   0.0s
[CV] END ...............................C=0.1, kernel=linear; total time=   0.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.0s
[CV] END .................................C=1, kernel=linear; total time=   0.0s
[CV] END .................................C=1, ke

In [74]:
svc = SVC(**res3.best_params_)
svc

In [84]:
svc.fit(res2, ytr)
xtr

Unnamed: 0,Review
801,"[5, principal, plaza, habana, vieja]"
4050,"[excelente, servicio, francisco, garcia, rapid..."
2762,"[ser, primero, vez, hospedir, hotel, ibi, pens..."
6720,"[buen, tarde, alojar, hotel, hacer, 6, año, e..."
4365,"[deteriorar, zona]"
...,...
6005,"[hotel, ser, maravilla, atención, primero, lu..."
1879,"[visitar, dos, vez, catedral, sal, regresar, h..."
1597,"[comida, debajo, espectativa, relación, calid..."
6261,"[ser, castillo, bonito, recorrer, tú, ir, sug..."


In [86]:
r1 = train_pipe[0].transform(xts)
r1

Stanza preprocessing... 

ValueError: If neither 'pretokenized' or 'no_ssplit' option is enabled, the input to the TokenizerProcessor must be a string or a Document object.  Got <class 'list'>