### Universidad Nacional de Córdoba - Facultad de Matemática, Astronomía, Física y Computación

### Diplomatura en Ciencia de Datos, Aprendizaje Automático y sus Aplicaciones 2021
Búsqueda y Recomendación para Textos Legales

Mentor: Jorge E. Pérez Villella

# Práctico Aprendizaje Supervisado

Integrantes:

- Correa Francisco
- Oviedo Christian


El objetivo de este práctico es afianzar los conocimientos adquiridos hasta este momento, haciendo un proceso de re-análisis de los datos para encarar desde distintas perspectivas (selección de features, redefinición de clases y subclases) para conseguir nuevos resultados sobre los modelos ya trabajados, añadiendo ensamble learning al análisis.

La idea es aprender a iterar en el proceso de ciencia de datos, no quedarnos con los resultados obtenidos del primer proceso realizado.

Profundizar el tema de stop words y cómo generar uno propio. 

En este práctico, para resolver el problema de la clasificación se propone entrenar los siguientes modelos de la librería scikit-learn: LogisticRegretion y SGDClassifier. 


Fecha de Entrega: 12 de septiembre de 2021

# Stop words

Al momento de realizar el práctico 2, *Práctico Análisis y Visualización* aplicamos diferentes técnicas para generar stop words. La técnica que no aplicamos y entendemos se puede aplicar a las ya aplicadas, es la identificar como stop words aquellas palabras que tengan un IDF **bajo**. La definición de que es **bajo**, es subjetivo. En nuestro  caso, si el F1-score nos da menor a un 70% luego de aplicar las nuevas stop words, consideraremos que eliminar esas stop words están empeorando la clasificación

## Identificamos las palabras cuyo IDF sea "bajo". 
  
Cuando el IDF es bajo, estamos frente a palabras que aparecen en muchos documentos y por ende no brindan información relevante al momento de clasificar.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

Recuperamos el corpus curado en los prácticos anteriores. Este corpus ya tiene eliminadas las palabras que fueron identificadas como stop words en esos prácticos.

In [2]:
corpus_file_name = 'cleaned_corpus.csv'

#corpus_file_name = 'cleaned_corpus_4.csv'



cleaned_corpus_tmp = pd.read_csv(corpus_file_name)


cleaned_corpus = cleaned_corpus_tmp.drop(cleaned_corpus_tmp.columns[0], axis=1)

X = cleaned_corpus['text']  
y = cleaned_corpus['classifier']



In [3]:
cleaned_corpus.head()

Unnamed: 0,text,id,classifier
0,dato causa sede ciudad cordoba dependencia juz...,4de122c24ab1606c9d67f4ff9e656143,Documentos/MENORES
1,univoco fecha materia revista familia tribunal...,1f9cdcb2c2596656b540c1271fc2d843,Documentos/MENORES
2,juzgado juventud violencia familiar 8ª cordoba...,17dcae14592fc6e87680ccb4251d9395,Documentos/MENORES
3,auto caratulado a. a. denuncia violencia gener...,4b3ae58648b6267ebb332feec8002588,Documentos/MENORES
4,juzg adolescencia violencia familiar 4ta cba s...,1316026beaa1d7e6530bdfe7e54f7b5c,Documentos/MENORES


Realizamos la vectorización con *TfidfVectorizer*

In [4]:
vectorizer = TfidfVectorizer()

X_train = X

vectorizer.fit(X_train)

TfidfVectorizer()

Creamos un data frame con el resultado de TFIDF para poder consultar los datos de manera más fácil.

In [5]:
def create_idf_data_frame(vectorizer):
    
    df_idf = pd.DataFrame(data = vectorizer.idf_ , columns= ["idf_weight"])
    df_idf['word'] = vectorizer.get_feature_names()

    
    return df_idf


df_idf = create_idf_data_frame(vectorizer)

sorted_df_idf = df_idf.sort_values(by=['idf_weight'])


sorted_df_idf.shape


(17964, 2)

In [6]:
print (f"Cantidad total de terminos {sorted_df_idf.shape[0]}")

Cantidad total de terminos 17964


Buscamos los percentiles 0.025,.05,.075 , 0.1 de los valores IDF

In [7]:
percent_df = sorted_df_idf.quantile([ .025,.05, .075, .1], axis = 0)
percent_df

Unnamed: 0,idf_weight
0.025,1.964569
0.05,2.471817
0.075,2.833607
0.1,3.164964


Generamos la lista de stop words. Notar que esto lo hacemos de manera empírica. Lo que hacemos es ir variando el tamaño de esta lista y entrenando modelos. Es decir, además de probar diferentes hiper parámetros, también probamos diferentes conjuntos de stop_words a eliminar del corpus. Luego comparamos los resultados obtenidos y buscamos la combinación de hiper parámetros y stop words que mejor resultado arrojen. 

Para poder verificar los valores mostrados en la sección de *Resultado*, se debe ir cambiand el valor de 

limit = percent_df.loc[.1].values[0]

Por ejemplo si queremos que se genere una lista de stop words con el 2.5% de las palabras con IDF más bajo se debe poner 

limit = percent_df.loc[.025].values[0]

In [8]:
#limit = percent_df.loc[.025].values[0]
#limit = percent_df.loc[.05].values[0]
#limit = percent_df.loc[.075].values[0]
#limit = percent_df.loc[.1].values[0]




#stop_words = df_idf[df_idf['idf_weight'] <= limit ]['word'].values.tolist()
stop_words = []

In [9]:
print (f"Cantidad de stop words {len(stop_words)}")

Cantidad de stop words 0


In [10]:
vectorizer = TfidfVectorizer(stop_words = stop_words )

X_train = cleaned_corpus['text']

vectorizer.fit(X_train)

TfidfVectorizer(stop_words=[])

In [11]:
sorted_df_idf = create_idf_data_frame(vectorizer)

In [12]:
sorted_df_idf

Unnamed: 0,idf_weight,word
0,4.299944,00
1,5.804021,0032
2,5.804021,003553549928
3,5.804021,00412415900
4,5.804021,00hs
...,...,...
17959,3.239072,zona
17960,5.110874,zonal
17961,5.804021,zuliani
17962,5.398556,zunino


# Entrenamiento de los modelos 


Vamos a entrenar los siguienes modelos:

- LogisticRegretion 
- SGDClassifier
- RandomForestClassifier
- MultinomialNB
- SVC

Definimos el valor del seed para que los experimentos sean repetibles

In [13]:
seed = 42

Separamos el data set en set de entrenamiento y set de test

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=seed)

In [15]:
def get_vectors(X_train, X_test, vectorizer):
    
    X_train_vect = vectorizer.fit_transform(X_train)
    X_text_vect = vectorizer.transform(X_test)
    
    return (X_train_vect, X_text_vect)

In [16]:
vectorizer = TfidfVectorizer(stop_words = stop_words)
X_train_vect , X_test_vect = get_vectors(X_train, X_test, vectorizer)

X_train_vect.shape

(162, 15696)

In [17]:
X_test_vect.shape

(81, 15696)

Realizamos una implementación de gridsearch con cross validation, que permite pasar diferentes modelos de sickit-learn a ajustar. 
La idea es que este método nos permita hacer pruebas de manera sencilla de diferentes modelos con diferentes parámetros (GridSearchCV de sickit learn no permite hacer pruebas de diferentes modelos.). Luego en base a estos resultados, elegimos que modelos y parámetros presentar en el apartado * Clasificación usando diferentes modelos*

A la función **train_modelos** se le pasan:
-	 Dos diccionarios: los modelos y los parámetros.
-	 Los sets de entrenamiento y test
-	 La cantidad de folds

La función hace el entrenamiento de todos los modelos en base a los parámetros que se le indican y usando el CV indicado. Los resultados de la función son transformados a un data frame. Luego se puede ordenar el data frame por diferentes criterios (recall, f1-score, etc.)

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import KFold

from sklearn import metrics
from sklearn.metrics import roc_auc_score

import itertools as it

from collections import Counter

In [19]:

models1 = {
    'RandomForset': RandomForestClassifier(),
    'MultinomialNB': MultinomialNB(),
    'SVM_01': svm.SVC(),
    'SVM_02': svm.SVC(),
    'LogisticRegressionClassifier': linear_model.LogisticRegression() ,
    'LogisticRegressionClassifier_01': linear_model.LogisticRegression() , 
    'SGDClassifier' : SGDClassifier()
    
}

params1 = {
    'RandomForset': {"n_estimators" : [100] , "criterion" : ["gini", "entropy"]},
    'LogisticRegressionClassifier': { "solver":["liblinear" , "sag", "saga","lbfgs"], "multi_class":["ovr"], "penalty":["l2" ] , "C": [1.0,0.7]  } ,
    'LogisticRegressionClassifier_01': { "solver":["liblinear" ], "multi_class":["ovr"], "penalty":["l2","l1"] , "C": [1.0,0.7,0.2]  } ,
    'SVM_01':{"kernel" :['poly'] , "degree" : [2,3,4,5] } ,
    'SVM_02':{"kernel" :['linear', 'rbf', 'sigmoid']  } ,
    'MultinomialNB':{"alpha" :[1.0] },
    'SGDClassifier': {"loss": ["hinge","log"], "alpha": [0.0001, 0.001, 0.01, 0.1] }
} 

In [20]:
from copy import copy, deepcopy



def roc_auc_score_macro(actual_class, pred_class, average = "macro"):

    roc_auc = roc_auc_score(actual_class, pred_class, average = average , multi_class ='ovr')
 
    return roc_auc



def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):

  #creating a set of all the unique classes using the actual class list
  unique_class = set(actual_class)
  roc_auc_dict = {}
  for per_class in unique_class:
    #creating a list of all the classes except the current class 
    other_class = [x for x in unique_class if x != per_class]

    #marking the current class as 1 and all other classes as 0
    new_actual_class = [0 if x in other_class else 1 for x in actual_class]
    new_pred_class = [0 if x in other_class else 1 for x in pred_class]

    #using the sklearn metrics method to calculate the roc_auc_score
    roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average , multi_class ='ovr')
    roc_auc_dict[per_class] = roc_auc

  return roc_auc_dict

#def train_model(model, folds_index, X_train, Y_train):

def generate_model_params(model_params):
    
    allNames = sorted(model_params)
    combinations = it.product(*(model_params[Name] for Name in allNames))
    return (list(combinations) , allNames)


def train_model(model_id, model, params_names, param_combination, folds_index, X_train, Y_train , X_test, Y_test , output_dict = True , random_state = None):
    
    param_combination = list(param_combination)
    print ("train model")
    print (f"{model} {params_names} {param_combination}")
   
    
    cloned_model = deepcopy(model)
    
    
    for param_name , param_value in zip(params_names,param_combination ):
        #print (f"{param_name} =  {param_value}")
        setattr(cloned_model , param_name , param_value)

    if type(random_state) == int:
        setattr(cloned_model , "random_state" , random_state)
        
    print (cloned_model)
    
    results = []
    
    for train_index, test_index in folds_index:
   
        cloned_model_tmp = deepcopy(cloned_model)
        #print (f"{train_index}")
        #print (f"{test_index}")
    
    
    #X_train, X_test, y_train, y_test
    
        # Se hace el split en base a los CV. Se obtienen los datos de X_train y de X_test con sus respectivos Y
        X_train_tmp, X_test_tmp = X_train[train_index], X_train[test_index]
        
        y_train_tmp, y_test_tmp = Y_train[train_index], Y_train[test_index] 
    
        cloned_model_tmp.fit(X_train_tmp,y_train_tmp)
       
    
        y_test_val_pred = cloned_model_tmp.predict(X_test_tmp)
        
        train_result = metrics.classification_report(y_test_tmp, y_test_val_pred , output_dict = output_dict )
        
        print(train_result)
        
        
        #roc_result = roc_auc_score(y_true = y_test_tmp, y_score = y_test_val_pred , multi_class = "ovr")
        
        roc_result = roc_auc_score_multiclass(actual_class=y_test_tmp, pred_class=y_test_val_pred)
        #roc_result_macro = roc_auc_score_macro(actual_class=y_test_tmp, pred_class=y_test_val_pred)
        
        
        results.append ((f"{model}","Train" , f"{params_names }", train_result , roc_result , f"{param_combination}" , f"{model}_{model_id}" ))
    
    
    cloned_model_tmp = deepcopy(cloned_model)
    
    
    cloned_model_tmp.fit(X_train,Y_train)
        
    y_test_pred = cloned_model_tmp.predict(X_test)
    
    test_result = metrics.classification_report(Y_test, y_test_pred , output_dict = output_dict )
    
    #roc_result = roc_auc_score(y_true = Y_test, y_score = y_test_pred , multi_class = "ovr")
    roc_result = roc_auc_score_multiclass(actual_class=Y_test, pred_class=y_test_pred)
    #roc_result_macro = roc_auc_score_macro(actual_class=y_test_tmp, pred_class=y_test_val_pred)
        
    results.append ((f"{model}","Test", f"{params_names} ", test_result , roc_result , f"{param_combination}" , f"{model}_{model_id}" ))
    
    print("Test")
    print(test_result)
    
    return results
    

def sum_train_values(results):

    
   
    total = (0,0,0)
    
    for model_result in results:
            total = (total[0] + model_result[3]['macro avg']['precision'] , total[1] + model_result[3]['macro avg']['recall'],total [2] + model_result[3]['macro avg']['recall'])

    #total = total / len (results)
    cantidad_filas = len (results)
    
    total = (total[0] / cantidad_filas, total[1] / cantidad_filas, total[2] / cantidad_filas)
    print ("Ponderado")
    print (f"{total}")
    
    
    
    return total

def train_models(X_train,Y_train,X_test, Y_test, cv=5,shuffle=True, models=None ,params=None , output_dict = True , random_state = None):
    
    results = []
    
    kf = KFold(n_splits=cv, random_state=random_state, shuffle=shuffle )
   
    model_id = 0 
    
    folds_index = [(train_index, test_index) for train_index, test_index in kf.split(X_train)  ]

    for param_model in params.keys():
    
        params_combination, params_names = generate_model_params(params.get(param_model))
        #print (f"Modelo a ejecutar: {param_model}, parámetros a probar: {params_combination} , nombre de los parámetros: {params_names} ")
        
        for param_combination in params_combination:
            #print (f"{param_model}: {param_combination} ")
            
            model_result = train_model(model_id = model_id, model = models.get(param_model),params_names = params_names, param_combination = param_combination, folds_index = folds_index, X_train = X_train, Y_train = Y_train , X_test = X_test, Y_test = Y_test , output_dict = output_dict , random_state = random_state )     
            
         
            results.extend( model_result  )
    
    
        model_id = model_id + 1  

    return results        
       

Esta función arma un data frame con el resultado de los entrenamientos. Notar que para calcular el ROC_AUC, se hace una suma de los valores ponderados del ROC_AUC por clase. Notar que el valor ROC_AUC que se obtiene acá es diferente al que se muestran en los diagramas. Esto se debe a que en esta función estamos calculando el ROC_AUC ponderado, mientras que en la librería yellowbrick, se hace el calculo de la ROC_AUC micro y macro que usa otro criterio (ver https://www.scikit-yb.org/en/latest/api/classifier/rocauc.html). Nos parece más conveniente el criterio que planteamos nosotros.

In [21]:
def toDataFrame(results, y_test):
    
    counter = Counter(y_test)
    total = counter['Documentos/FAMILIA'] + counter['Documentos/LABORAL'] + counter['Documentos/MENORES'] + counter['Documentos/PENAL']
    familia = counter['Documentos/FAMILIA'] / total
    laboral = counter['Documentos/LABORAL'] / total
    menores = counter['Documentos/MENORES'] / total
    penal = counter['Documentos/PENAL'] / total
    
    print ("Ponderado fuero")
    print (f"familia: {familia}, laboral: {laboral}, menores: {menores}, penal: {penal} ")
    
    filtered_values =  []
    columns = ["id", "modelo", "modo" , "parametros" , "valores" , "accuracy", "precision" , "recall" , "f1-score" , "roc_penal", "roc_familia" ,"roc_laboral" , "roc_menores" ,]
    for result in results:
        #print (f"{result[0]} {result[1]} {result[2]} {result[3]['macro avg']} \n")
        filtered_values.append(( result[6], result[0], result[1] , result[2] , result[5] , result[3]['accuracy'], result[3]['macro avg']['precision'] , result[3]['macro avg']['recall'] ,  result[3]['macro avg']['f1-score'] , result[4]["Documentos/PENAL"] , result[4]["Documentos/FAMILIA"] , result[4]["Documentos/LABORAL"] , result[4]["Documentos/MENORES"]))

    df= pd.DataFrame(data = filtered_values , columns = columns)
    
    df["roc_ponderado"] = (df["roc_penal"] * penal + df["roc_familia"] * familia + df["roc_laboral"] * laboral + df["roc_menores"] * menores)
    return df

Fecha de Entrega: 12 de septiembre de 2021

In [22]:
results  = train_models(X_train= X_train_vect, Y_train =y_train.values ,  X_test = X_test_vect, Y_test = y_test.values,  models = models1 , params = params1 , cv=5 , output_dict = True , random_state = seed )

train model
RandomForestClassifier() ['criterion', 'n_estimators'] ['gini', 100]
RandomForestClassifier(random_state=42)
{'Documentos/FAMILIA': {'precision': 0.9375, 'recall': 1.0, 'f1-score': 0.967741935483871, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 7}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.6666666666666666, 'f1-score': 0.8, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8}, 'accuracy': 0.9696969696969697, 'macro avg': {'precision': 0.984375, 'recall': 0.9166666666666666, 'f1-score': 0.9419354838709677, 'support': 33}, 'weighted avg': {'precision': 0.9715909090909091, 'recall': 0.9696969696969697, 'f1-score': 0.9671554252199412, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.9565217391304348, 'recall': 1.0, 'f1-score': 0.9777777777777777, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/

Test
{'Documentos/FAMILIA': {'precision': 0.9166666666666666, 'recall': 1.0, 'f1-score': 0.9565217391304348, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.6, 'f1-score': 0.7499999999999999, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'accuracy': 0.9506172839506173, 'macro avg': {'precision': 0.9791666666666666, 'recall': 0.9, 'f1-score': 0.9266304347826086, 'support': 81}, 'weighted avg': {'precision': 0.9547325102880658, 'recall': 0.9506172839506173, 'f1-score': 0.9455179817498658, 'support': 81}}
train model
LogisticRegression() ['C', 'multi_class', 'penalty', 'solver'] [1.0, 'ovr', 'l2', 'sag']
LogisticRegression(multi_class='ovr', random_state=42, solver='sag')
{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision':



{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.8571428571428571, 'f1-score': 0.923076923076923, 'support': 7}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.9583333333333334, 'recall': 0.7976190476190477, 'f1-score': 0.833041958041958, 'support': 33}, 'weighted avg': {'precision': 0.9242424242424242, 'recall': 0.9090909090909091, 'f1-score': 0.8969061241788514, 'support': 33}}




{'Documentos/FAMILIA': {'precision': 0.9166666666666666, 'recall': 1.0, 'f1-score': 0.9565217391304348, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9393939393939394, 'macro avg': {'precision': 0.9791666666666666, 'recall': 0.8333333333333334, 'f1-score': 0.8641304347826086, 'support': 33}, 'weighted avg': {'precision': 0.9444444444444444, 'recall': 0.9393939393939394, 'f1-score': 0.9255599472990778, 'support': 33}}




{'Documentos/FAMILIA': {'precision': 0.7894736842105263, 'recall': 1.0, 'f1-score': 0.8823529411764706, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.25, 'f1-score': 0.4, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5}, 'accuracy': 0.875, 'macro avg': {'precision': 0.9473684210526316, 'recall': 0.78125, 'f1-score': 0.803921568627451, 'support': 32}, 'weighted avg': {'precision': 0.9013157894736843, 'recall': 0.875, 'f1-score': 0.853186274509804, 'support': 32}}




{'Documentos/FAMILIA': {'precision': 0.9285714285714286, 'recall': 1.0, 'f1-score': 0.962962962962963, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.6666666666666666, 'f1-score': 0.8, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.96875, 'macro avg': {'precision': 0.9821428571428572, 'recall': 0.9166666666666666, 'f1-score': 0.9407407407407408, 'support': 32}, 'weighted avg': {'precision': 0.9709821428571428, 'recall': 0.96875, 'f1-score': 0.9662037037037037, 'support': 32}}




{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.5, 'f1-score': 0.6666666666666666, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 9}, 'accuracy': 0.90625, 'macro avg': {'precision': 0.9583333333333334, 'recall': 0.875, 'f1-score': 0.8939393939393939, 'support': 32}, 'weighted avg': {'precision': 0.921875, 'recall': 0.90625, 'f1-score': 0.8948863636363636, 'support': 32}}




Test
{'Documentos/FAMILIA': {'precision': 0.9166666666666666, 'recall': 1.0, 'f1-score': 0.9565217391304348, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.6, 'f1-score': 0.7499999999999999, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'accuracy': 0.9506172839506173, 'macro avg': {'precision': 0.9791666666666666, 'recall': 0.9, 'f1-score': 0.9266304347826086, 'support': 81}, 'weighted avg': {'precision': 0.9547325102880658, 'recall': 0.9506172839506173, 'f1-score': 0.9455179817498658, 'support': 81}}
train model
LogisticRegression() ['C', 'multi_class', 'penalty', 'solver'] [1.0, 'ovr', 'l2', 'lbfgs']
LogisticRegression(multi_class='ovr', random_state=42)
{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recal

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test
{'Documentos/FAMILIA': {'precision': 0.8627450980392157, 'recall': 1.0, 'f1-score': 0.9263157894736842, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3, 'f1-score': 0.4615384615384615, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'accuracy': 0.9135802469135802, 'macro avg': {'precision': 0.9656862745098039, 'recall': 0.825, 'f1-score': 0.8469635627530364, 'support': 81}, 'weighted avg': {'precision': 0.925441781650932, 'recall': 0.9135802469135802, 'f1-score': 0.8934972759534162, 'support': 81}}
train model
LogisticRegression() ['C', 'multi_class', 'penalty', 'solver'] [0.7, 'ovr', 'l2', 'sag']
LogisticRegression(C=0.7, multi_class='ovr', random_state=42, solver='sag')
{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'pre

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.88, 'recall': 1.0, 'f1-score': 0.9361702127659575, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.72, 'recall': 0.75, 'f1-score': 0.7340425531914894, 'support': 33}, 'weighted avg': {'precision': 0.8290909090909091, 'recall': 0.9090909090909091, 'f1-score': 0.8665377176015473, 'support': 33}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.75, 'recall': 1.0, 'f1-score': 0.8571428571428571, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5}, 'accuracy': 0.84375, 'macro avg': {'precision': 0.6875, 'recall': 0.71875, 'f1-score': 0.6976190476190476, 'support': 32}, 'weighted avg': {'precision': 0.7578125, 'recall': 0.84375, 'f1-score': 0.7913690476190476, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8125, 'recall': 1.0, 'f1-score': 0.896551724137931, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.90625, 'macro avg': {'precision': 0.703125, 'recall': 0.75, 'f1-score': 0.7241379310344828, 'support': 32}, 'weighted avg': {'precision': 0.830078125, 'recall': 0.90625, 'f1-score': 0.8642241379310345, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.6818181818181818, 'recall': 1.0, 'f1-score': 0.8108108108108109, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.8888888888888888, 'f1-score': 0.9411764705882353, 'support': 9}, 'accuracy': 0.78125, 'macro avg': {'precision': 0.6704545454545454, 'recall': 0.7222222222222222, 'f1-score': 0.6879968203497615, 'support': 32}, 'weighted avg': {'precision': 0.6633522727272727, 'recall': 0.78125, 'f1-score': 0.7072734499205088, 'support': 32}}
Test
{'Documentos/FAMILIA': {'precision': 0.8627450980392157, 'recall': 1.0, 'f1-score': 0.9263157894736842, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3, 'f1-score': 0.4615384615384615, 'support': 10}, 'Docume



{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.8571428571428571, 'f1-score': 0.923076923076923, 'support': 7}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.9583333333333334, 'recall': 0.7976190476190477, 'f1-score': 0.833041958041958, 'support': 33}, 'weighted avg': {'precision': 0.9242424242424242, 'recall': 0.9090909090909091, 'f1-score': 0.8969061241788514, 'support': 33}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.88, 'recall': 1.0, 'f1-score': 0.9361702127659575, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.72, 'recall': 0.75, 'f1-score': 0.7340425531914894, 'support': 33}, 'weighted avg': {'precision': 0.8290909090909091, 'recall': 0.9090909090909091, 'f1-score': 0.8665377176015473, 'support': 33}}




{'Documentos/FAMILIA': {'precision': 0.7894736842105263, 'recall': 1.0, 'f1-score': 0.8823529411764706, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.25, 'f1-score': 0.4, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5}, 'accuracy': 0.875, 'macro avg': {'precision': 0.9473684210526316, 'recall': 0.78125, 'f1-score': 0.803921568627451, 'support': 32}, 'weighted avg': {'precision': 0.9013157894736843, 'recall': 0.875, 'f1-score': 0.853186274509804, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8125, 'recall': 1.0, 'f1-score': 0.896551724137931, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.90625, 'macro avg': {'precision': 0.703125, 'recall': 0.75, 'f1-score': 0.7241379310344828, 'support': 32}, 'weighted avg': {'precision': 0.830078125, 'recall': 0.90625, 'f1-score': 0.8642241379310345, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.6818181818181818, 'recall': 1.0, 'f1-score': 0.8108108108108109, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.8888888888888888, 'f1-score': 0.9411764705882353, 'support': 9}, 'accuracy': 0.78125, 'macro avg': {'precision': 0.6704545454545454, 'recall': 0.7222222222222222, 'f1-score': 0.6879968203497615, 'support': 32}, 'weighted avg': {'precision': 0.6633522727272727, 'recall': 0.78125, 'f1-score': 0.7072734499205088, 'support': 32}}




Test
{'Documentos/FAMILIA': {'precision': 0.8627450980392157, 'recall': 1.0, 'f1-score': 0.9263157894736842, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3, 'f1-score': 0.4615384615384615, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'accuracy': 0.9135802469135802, 'macro avg': {'precision': 0.9656862745098039, 'recall': 0.825, 'f1-score': 0.8469635627530364, 'support': 81}, 'weighted avg': {'precision': 0.925441781650932, 'recall': 0.9135802469135802, 'f1-score': 0.8934972759534162, 'support': 81}}
train model
LogisticRegression() ['C', 'multi_class', 'penalty', 'solver'] [0.7, 'ovr', 'l2', 'lbfgs']
LogisticRegression(C=0.7, multi_class='ovr', random_state=42)
{'Documentos/FAMILIA': {'precision': 0.8333333333333334, 'recall': 1.0, 'f1-score': 0.9090909090909091, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.88, 'recall': 1.0, 'f1-score': 0.9361702127659575, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.72, 'recall': 0.75, 'f1-score': 0.7340425531914894, 'support': 33}, 'weighted avg': {'precision': 0.8290909090909091, 'recall': 0.9090909090909091, 'f1-score': 0.8665377176015473, 'support': 33}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.75, 'recall': 1.0, 'f1-score': 0.8571428571428571, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5}, 'accuracy': 0.84375, 'macro avg': {'precision': 0.6875, 'recall': 0.71875, 'f1-score': 0.6976190476190476, 'support': 32}, 'weighted avg': {'precision': 0.7578125, 'recall': 0.84375, 'f1-score': 0.7913690476190476, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8125, 'recall': 1.0, 'f1-score': 0.896551724137931, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.90625, 'macro avg': {'precision': 0.703125, 'recall': 0.75, 'f1-score': 0.7241379310344828, 'support': 32}, 'weighted avg': {'precision': 0.830078125, 'recall': 0.90625, 'f1-score': 0.8642241379310345, 'support': 32}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.6818181818181818, 'recall': 1.0, 'f1-score': 0.8108108108108109, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.8888888888888888, 'f1-score': 0.9411764705882353, 'support': 9}, 'accuracy': 0.78125, 'macro avg': {'precision': 0.6704545454545454, 'recall': 0.7222222222222222, 'f1-score': 0.6879968203497615, 'support': 32}, 'weighted avg': {'precision': 0.6633522727272727, 'recall': 0.78125, 'f1-score': 0.7072734499205088, 'support': 32}}
Test
{'Documentos/FAMILIA': {'precision': 0.8627450980392157, 'recall': 1.0, 'f1-score': 0.9263157894736842, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 12}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3, 'f1-score': 0.4615384615384615, 'support': 10}, 'Docume

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8666666666666667, 'recall': 1.0, 'f1-score': 0.9285714285714286, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.9375, 'macro avg': {'precision': 0.9666666666666667, 'recall': 0.8333333333333334, 'f1-score': 0.8571428571428572, 'support': 32}, 'weighted avg': {'precision': 0.9458333333333333, 'recall': 0.9375, 'f1-score': 0.9241071428571428, 'support': 32}}
{'Documentos/FAMILIA': {'precision': 0.7142857142857143, 'recall': 1.0, 'f1-score': 0.8333333333333333, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recal

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.88, 'recall': 1.0, 'f1-score': 0.9361702127659575, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9090909090909091, 'macro avg': {'precision': 0.72, 'recall': 0.75, 'f1-score': 0.7340425531914894, 'support': 33}, 'weighted avg': {'precision': 0.8290909090909091, 'recall': 0.9090909090909091, 'f1-score': 0.8665377176015473, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.7894736842105263, 'recall': 1.0, 'f1-score': 0.8823529411764706, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.25, 'f1-score': 0.4, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-sco

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.75, 'recall': 1.0, 'f1-score': 0.8571428571428571, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.8571428571428571, 'f1-score': 0.923076923076923, 'support': 7}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'accuracy': 0.8484848484848485, 'macro avg': {'precision': 0.6875, 'recall': 0.6830357142857143, 'f1-score': 0.6783882783882784, 'support': 33}, 'weighted avg': {'precision': 0.7954545454545454, 'recall': 0.8484848484848485, 'f1-score': 0.8116772116772116, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.8461538461538461, 'recall': 1.0, 'f1-score': 0.9166666666666666, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8125, 'recall': 1.0, 'f1-score': 0.896551724137931, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 0.90625, 'macro avg': {'precision': 0.703125, 'recall': 0.75, 'f1-score': 0.7241379310344828, 'support': 32}, 'weighted avg': {'precision': 0.830078125, 'recall': 0.90625, 'f1-score': 0.8642241379310345, 'support': 32}}
{'Documentos/FAMILIA': {'precision': 0.6818181818181818, 'recall': 1.0, 'f1-score': 0.8108108108108109, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.8888888888888888, 'f1-score': 0.9411764705882353, '

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

{'Documentos/FAMILIA': {'precision': 0.5555555555555556, 'recall': 1.0, 'f1-score': 0.7142857142857143, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 9}, 'accuracy': 0.625, 'macro avg': {'precision': 0.6388888888888888, 'recall': 0.5833333333333334, 'f1-score': 0.5535714285714286, 'support': 32}, 'weighted avg': {'precision': 0.6041666666666667, 'recall': 0.625, 'f1-score': 0.5379464285714286, 'support': 32}}
Test
{'Documentos/FAMILIA': {'precision': 0.7213114754098361, 'recall': 1.0, 'f1-score': 0.8380952380952381, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.6666666666666666, 'f1-score': 0.8, 'support': 12}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}, 'Documentos/PENAL': {'prec

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test
{'Documentos/FAMILIA': {'precision': 0.5432098765432098, 'recall': 1.0, 'f1-score': 0.704, 'support': 44}, 'Documentos/LABORAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 12}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}, 'Documentos/PENAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 15}, 'accuracy': 0.5432098765432098, 'macro avg': {'precision': 0.13580246913580246, 'recall': 0.25, 'f1-score': 0.176, 'support': 81}, 'weighted avg': {'precision': 0.29507696997408933, 'recall': 0.5432098765432098, 'f1-score': 0.38241975308641973, 'support': 81}}
train model
SVC() ['degree', 'kernel'] [2, 'poly']
SVC(degree=2, kernel='poly', random_state=42)
{'Documentos/FAMILIA': {'precision': 0.8823529411764706, 'recall': 1.0, 'f1-score': 0.9375, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.8571428571428571, 'f1-score': 0.923076923076923, 'support': 7}, 'Documentos/MENORES': {'precision': 1.0,

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.8148148148148148, 'recall': 1.0, 'f1-score': 0.8979591836734693, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.6666666666666666, 'f1-score': 0.8, 'support': 6}, 'accuracy': 0.8484848484848485, 'macro avg': {'precision': 0.7037037037037037, 'recall': 0.6666666666666666, 'f1-score': 0.6744897959183673, 'support': 33}, 'weighted avg': {'precision': 0.7856341189674523, 'recall': 0.8484848484848485, 'f1-score': 0.804700061842919, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.7894736842105263, 'recall': 1.0, 'f1-score': 0.8823529411764706, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.875, 'f1-score': 0.9333333333333333, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.5, 'f1-score': 0.6666666666666666, 'suppo

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.7096774193548387, 'recall': 1.0, 'f1-score': 0.8301886792452831, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'accuracy': 0.7272727272727273, 'macro avg': {'precision': 0.4274193548387097, 'recall': 0.5, 'f1-score': 0.45754716981132076, 'support': 33}, 'weighted avg': {'precision': 0.533724340175953, 'recall': 0.7272727272727273, 'f1-score': 0.614065180102916, 'support': 33}}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.5, 'recall': 1.0, 'f1-score': 0.6666666666666666, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.125, 'f1-score': 0.2222222222222222, 'support': 8}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.2, 'f1-score': 0.33333333333333337, 'support': 5}, 'accuracy': 0.53125, 'macro avg': {'precision': 0.625, 'recall': 0.33125, 'f1-score': 0.3055555555555556, 'support': 32}, 'weighted avg': {'precision': 0.640625, 'recall': 0.53125, 'f1-score': 0.42013888888888895, 'support': 32}}
{'Documentos/FAMILIA': {'precision': 0.48148148148148145, 'recall': 1.0, 'f1-score': 0.65, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 6}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.2, 'f1

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test
{'Documentos/FAMILIA': {'precision': 0.6567164179104478, 'recall': 1.0, 'f1-score': 0.7927927927927928, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.5833333333333334, 'f1-score': 0.7368421052631579, 'support': 12}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.4666666666666667, 'f1-score': 0.6363636363636364, 'support': 15}, 'accuracy': 0.7160493827160493, 'macro avg': {'precision': 0.664179104477612, 'recall': 0.5125000000000001, 'f1-score': 0.5414996336048967, 'support': 81}, 'weighted avg': {'precision': 0.6900681776303667, 'recall': 0.7160493827160493, 'f1-score': 0.6576597863147571, 'support': 81}}
train model
SVC() ['degree', 'kernel'] [5, 'poly']
SVC(degree=5, kernel='poly', random_state=42)
{'Documentos/FAMILIA': {'precision': 0.45454545454545453, 'recall': 1.0, 'f1-score': 0.625, 'support': 15}, 'Documentos/LABORAL': {'precision': 0.0, 'recall': 0.0, 'f1-s

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.6875, 'recall': 1.0, 'f1-score': 0.8148148148148148, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.5, 'f1-score': 0.6666666666666666, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'accuracy': 0.696969696969697, 'macro avg': {'precision': 0.421875, 'recall': 0.375, 'f1-score': 0.37037037037037035, 'support': 33}, 'weighted avg': {'precision': 0.5189393939393939, 'recall': 0.696969696969697, 'f1-score': 0.5836139169472502, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.4838709677419355, 'recall': 1.0, 'f1-score': 0.6521739130434783, 'support': 15}, 'Documentos/LABORAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.2, 'f1-

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 0.41935483870967744, 'recall': 1.0, 'f1-score': 0.5909090909090909, 'support': 13}, 'Documentos/LABORAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 6}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.3333333333333333, 'f1-score': 0.5, 'support': 3}, 'Documentos/PENAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}, 'accuracy': 0.4375, 'macro avg': {'precision': 0.3548387096774194, 'recall': 0.3333333333333333, 'f1-score': 0.2727272727272727, 'support': 32}, 'weighted avg': {'precision': 0.2641129032258065, 'recall': 0.4375, 'f1-score': 0.28693181818181823, 'support': 32}}
{'Documentos/FAMILIA': {'precision': 0.4838709677419355, 'recall': 1.0, 'f1-score': 0.6521739130434783, 'support': 15}, 'Documentos/LABORAL': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.16666666666666666, 'f1-score': 0.2857142857142857, 'support': 6}, 'Documentos/

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Test
{'Documentos/FAMILIA': {'precision': 0.5866666666666667, 'recall': 1.0, 'f1-score': 0.7394957983193278, 'support': 44}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.16666666666666666, 'f1-score': 0.2857142857142857, 'support': 12}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 10}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.26666666666666666, 'f1-score': 0.4210526315789474, 'support': 15}, 'accuracy': 0.6172839506172839, 'macro avg': {'precision': 0.6466666666666667, 'recall': 0.35833333333333334, 'f1-score': 0.3615656789031402, 'support': 81}, 'weighted avg': {'precision': 0.6520164609053498, 'recall': 0.6172839506172839, 'f1-score': 0.5220021731889637, 'support': 81}}
train model
SVC() ['kernel'] ['linear']
SVC(kernel='linear', random_state=42)
{'Documentos/FAMILIA': {'precision': 0.9375, 'recall': 1.0, 'f1-score': 0.967741935483871, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'sup

{'Documentos/FAMILIA': {'precision': 0.9565217391304348, 'recall': 1.0, 'f1-score': 0.9777777777777777, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 0.6666666666666666, 'f1-score': 0.8, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 0.9696969696969697, 'macro avg': {'precision': 0.9891304347826086, 'recall': 0.9166666666666666, 'f1-score': 0.9444444444444444, 'support': 33}, 'weighted avg': {'precision': 0.9710144927536231, 'recall': 0.9696969696969697, 'f1-score': 0.967003367003367, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0,

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'Documentos/FAMILIA': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'accuracy': 1.0, 'macro avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 33}, 'weighted avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 8}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 4}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 5}, 'accuracy': 1.0, 'macro avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 32}, 'weighted avg': {'p

{'Documentos/FAMILIA': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 13}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 10}, 'accuracy': 1.0, 'macro avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 32}, 'weighted avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 32}}
{'Documentos/FAMILIA': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 6}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 9}, 'accuracy': 1.0, 'macro avg': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 32}, 'weighted avg': {'

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

{'Documentos/FAMILIA': {'precision': 0.6521739130434783, 'recall': 1.0, 'f1-score': 0.7894736842105263, 'support': 15}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 0.7142857142857143, 'f1-score': 0.8333333333333333, 'support': 7}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Documentos/PENAL': {'precision': 1.0, 'recall': 0.625, 'f1-score': 0.7692307692307693, 'support': 8}, 'accuracy': 0.7575757575757576, 'macro avg': {'precision': 0.6630434782608696, 'recall': 0.5848214285714286, 'f1-score': 0.5980094466936572, 'support': 33}, 'weighted avg': {'precision': 0.7509881422924901, 'recall': 0.7575757575757576, 'f1-score': 0.7220995378890116, 'support': 33}}
{'Documentos/FAMILIA': {'precision': 0.88, 'recall': 1.0, 'f1-score': 0.9361702127659575, 'support': 22}, 'Documentos/LABORAL': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 2}, 'Documentos/MENORES': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 3}, 'Doc

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
result_logistic = toDataFrame(results , y_test.values)

Ponderado fuero
familia: 0.5432098765432098, laboral: 0.14814814814814814, menores: 0.12345679012345678, penal: 0.18518518518518517 


### Ordenamos los modelos (y los parámetros utilizados) según diferentes métricas

#### TEST: f1-score y accuracy 

Al ser ser un data set desbalanceado, tomamos el los mejores f1-score. El accuracy al ser un data set desbalanceado no es recomendable utilizarlo.

In [24]:
result_logistic[result_logistic["modo"] =="Test"].sort_values(by=['f1-score','accuracy'] , ascending = False)

Unnamed: 0,id,modelo,modo,parametros,valores,accuracy,precision,recall,f1-score,roc_penal,roc_familia,roc_laboral,roc_menores,roc_ponderado
11,RandomForestClassifier()_0,RandomForestClassifier(),Test,"['criterion', 'n_estimators']","['entropy', 100]",0.987654,0.994444,0.975,0.984033,1.0,0.986486,1.0,0.95,0.986486
5,RandomForestClassifier()_0,RandomForestClassifier(),Test,"['criterion', 'n_estimators']","['gini', 100]",0.975309,0.969318,0.969318,0.969318,1.0,0.975123,1.0,0.942958,0.979444
173,SGDClassifier()_6,SGDClassifier(),Test,"['alpha', 'loss']","[0.01, 'hinge']",0.975309,0.969318,0.969318,0.969318,1.0,0.975123,1.0,0.942958,0.979444
101,SVC()_3,SVC(),Test,"['degree', 'kernel']","[2, 'poly']",0.975309,0.98913,0.95,0.966667,1.0,0.972973,1.0,0.9,0.972973
131,SVC()_4,SVC(),Test,['kernel'],['rbf'],0.975309,0.98913,0.95,0.966667,1.0,0.972973,1.0,0.9,0.972973
167,SGDClassifier()_6,SGDClassifier(),Test,"['alpha', 'loss']","[0.001, 'log']",0.975309,0.98913,0.95,0.966667,1.0,0.972973,1.0,0.9,0.972973
125,SVC()_4,SVC(),Test,['kernel'],['linear'],0.962963,0.948732,0.963636,0.955665,1.0,0.963759,1.0,0.935915,0.972402
149,SGDClassifier()_6,SGDClassifier(),Test,"['alpha', 'loss']","[0.0001, 'hinge']",0.962963,0.948732,0.963636,0.955665,1.0,0.963759,1.0,0.935915,0.972402
155,SGDClassifier()_6,SGDClassifier(),Test,"['alpha', 'loss']","[0.0001, 'log']",0.962963,0.948732,0.963636,0.955665,1.0,0.963759,1.0,0.935915,0.972402
161,SGDClassifier()_6,SGDClassifier(),Test,"['alpha', 'loss']","[0.001, 'hinge']",0.962963,0.948732,0.963636,0.955665,1.0,0.963759,1.0,0.935915,0.972402


#### Train: f1-score y accuracy 

Al ser ser un data set desbalanceado, tomamos el los mejores f1-score. **Notar que se hace un promedio por modelo**. Es decir se hace un promedio
resultado de aplicar todas las convinaciones de parametros por modelo

In [25]:
result_train = result_logistic[result_logistic["modo"] !="Test"]

result_train.groupby(['id']).mean().sort_values(['f1-score' ,'accuracy'], ascending=False)

Unnamed: 0_level_0,accuracy,precision,recall,f1-score,roc_penal,roc_familia,roc_laboral,roc_menores,roc_ponderado
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
RandomForestClassifier()_0,0.975379,0.988451,0.945833,0.962362,1.0,0.973589,1.0,0.891667,0.972279
SVC()_4,0.975505,0.989298,0.945536,0.959901,1.0,0.972628,0.991071,0.9,0.971463
SGDClassifier()_6,0.919437,0.904974,0.876503,0.875803,0.956875,0.923346,0.953423,0.842708,0.924056
LogisticRegression()_1,0.891667,0.867654,0.789038,0.796985,0.994444,0.891411,0.973214,0.610417,0.88792
LogisticRegression()_2,0.771117,0.649828,0.634049,0.614356,0.860401,0.776322,0.846189,0.561111,0.775673
SVC()_3,0.718703,0.718092,0.595799,0.592381,0.728056,0.720721,0.813542,0.65,0.727099
MultinomialNB()_5,0.66553,0.550492,0.521806,0.492194,0.826944,0.685644,0.716667,0.5,0.693488


## Resultados

Al tomar como stop words (y eliminaras del dataset) las palabras del decil más bajo en base a los valores IDF, estamos obteniendo resultados en los modelos más bajos (f1-score, accuracy, ROC_AUC más bajos). Esto puede sugerir que estamos rompiendo el overfitting, lo mismo, para poder validar esta hipotesis es necesario contar con más datos. Por ejemplo, en el caso del Random Forest estamos obteniendo los siguientes resultados


| % eliminado    | f1-score     |
| :------------- | -----------: |
|  0%            | 0.962362     |
|  2.5%          | 0.911784     |
|  5%            | 0.884469     |
|  7.5%          | 0.811515     |
|  10%           | 0.707842     | 

Para obtener estos resultados, cambiar en la fila X el índice del data frame de resultado, por ejempo:


- limit = percent_df.loc[.025].values[0] para 2.5%
- limit = percent_df.loc[.05].values[0] para 5%
- limit = percent_df.loc[.1].values[0] para 10%

Para trabajar con 0%, comentar *stop_words = df_idf[df_idf['idf_weight'] <= limit ]['word'].values.tolist()*

y descomentar *#stop_words = []*


Eliminando el 10% de las palabras, el f1-score del Random Forrest (el modelo que mejor desempeño tiene para este problema en particular) cae a 70%. No seguimos eliminado mas stop words puesto que llevar el f1-score por debajo del 70% según nuestro criterio no es optmimo.

El clasificador *SGDClassifier* el cual fué sugerido para este práctico y no fue utilizado en el práctico anterior, quedo posicionado en segundo lugar.


# Breve introducción: Random Forest

En el siguiente apartado pasaremos a explicar **brevemente** de que se trata el algoritmo de ML conocido como Random Forest.

## ¿Por que explicarlo?

Siendo que ya hemos utilizado RandomForest en el práctico anterior (fue el mejor modelo que obtuvimos) y dado que este práctico involucra sumar alguno _Emsemble method_ nos pareció correcto agregar un breve introducción al mismo.

## Random Forest Algorithm

Su facilidad de uso y versatilidad (siendo que maneja problemas de clasificación y regresión) han hecho que este algoritmo se adopte rápidamente. Este algoritmo combina múltiples arboles de decisión para así crear un "bosque" (forest).

### Arboles de decisión (Decision trees)

Una imagen nos ayuda a explicar esto mejor:

<img src="images/decision-tree.png">

Un árbol de decisión es otro algoritmo usado para clasificar datos. En términos simples se lo puede pensar como un mapa o grafico de flujos que muestra un camino claro hacia una decisión. Incluye condicionales para clasificar los datos (preguntas, como se ve en el ejemplo anterior). Empieza con un nodo raíz y desde ahí comienzan las ramas, donde cada rama ofrece una salida diferente. Este modelo busca encontrar la mejor manera de dividir los datos y en en general son entrenados usando el algoritmo CART (Classification and Regression Tree). Una de las principales ventajas que presenta este modelo es su facilidad de interpretación siendo que los arboles pueden ser visualizados. Existen distintas métricas para evaluar la calidad de la división de datos realizada por el árbol, como "gini" o "information gain" (en sk-learn son "gini" y "entropy" respectivamente).

### Entonces, como funciona Random Forest?

Construyendo múltiples arboles de decisión, luego combinándolos, para obtener mejores predicciones. La lógica detrás es simple: Muchos modelos no relacionados entre si (entrenados separadamente) perforan mucho mejor como grupo que individualmente. 

Cuando se usa el algoritmo para clasificación, cada árbol clasifica o "da un voto". Luego se elige la clasificación con mayor cantidad de votos.

<img src="images/random-forest.png">

### Discusión

RandomForest ofrece una serie de beneficios claves, a enumerar:

- Reducción del riesgo de overfitting: Cuando el numero de arboles es robusto es difícil que el modelo llegue a un overfitting siendo que promediando arboles no relacionados baja la varianza y el error de predicción.
- Flexibilidad: Usado para clasificación y problemas de regresión.
- Es mas fácil determinar la importancia de una feature usando este algoritmo.

También presenta ciertos "chanllenges":

- Es un proceso que requiere mucho tiempo para data sets extensos.
- Mas tiempo también se relaciona a mas recursos computacionales son necesarios.
- Complejidad: La predicción de un solo árbol es mas fácil de interpretar comparándolo con la predicción de un "bosque".
