## Contexto:
Como Analista de Fraude de uma empresa de seguros de carro, preciso de uma solução eficiente que utilize machine learning para detectar fraudes em grandes volumes de dados, a fim de melhorar a precisão na identificação de atividades fraudulentas e minimizar perdas financeiras para a empresa.

Descrição: Como Analista de Fraude, Eu quero um aplicativo em Python que utilize técnicas de machine learning para analisar dados de seguros de carro, Para que eu possa detectar possíveis fraudes de maneira eficiente e eficaz.

# Critérios de Aceitação:
# Importação e Preparação de Dados:
O aplicativo deve permitir a importação de dados de seguros a partir de arquivos CSV, Excel ou de um banco de dados.
Deve realizar a limpeza dos dados, incluindo o tratamento de valores nulos e inconsistências.
Deve realizar a transformação dos dados categóricos em formatos numéricos utilizáveis pelo modelo de machine learning.
# Treinamento do Modelo:
O aplicativo deve oferecer a opção de treinar um modelo de machine learning utilizando um dataset histórico de seguros previamente identificado como legítimo ou fraudulento.
Deve incluir a divisão do dataset em conjuntos de treinamento e teste, e permitir a escolha do algoritmo de machine learning (e.g., Random Forest, Gradient Boosting, CatBoost).
# Detecção de Fraude:
O aplicativo deve ser capaz de aplicar o modelo treinado para prever a probabilidade de fraude em novos dados de seguros.
Deve gerar uma lista de registros com suas respectivas probabilidades de serem fraudulentos, destacando os casos com maior risco.
# Relatórios e Visualizações:
O aplicativo deve fornecer relatórios detalhados com métricas de performance do modelo (e.g., precisão, recall, F1-score, AUC-ROC).
Deve oferecer visualizações interativas, como gráficos de distribuição de fraudes detectadas, importância das features no modelo e comparações entre registros fraudulentos e legítimos.


In [None]:

from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler,Normalizer
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score
from typing import Union
from sklearn.base import BaseEstimator
from typing import Any, List
import time
from sklearn.base import BaseEstimator
from typing import Any, List
import time
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.model_selection import GroupKFold
from sklearn.model_selection._search import ParameterSampler
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
import pickle
import copy
from sklearn.base import clone
import matplotlib.pyplot as plt

### Dataset

In [None]:
class Dataset:
    """
    A class representing a dataset.

    Parameters:
    - df (pandas.DataFrame): The input dataframe.
    - cat_columns (list): A list of categorical column names.
    - date_columns (list): A list of date column names.
    - cont_columns (list): A list of continuous column names.
    - disc_column (list): A list of discrete column names.
    - label_column (str or list, optional): The label column name(s). Defaults to None.

    Attributes:
    - _df (pandas.DataFrame): The input dataframe.
    - cat_columns (list): A list of categorical column names.
    - date_columns (list): A list of date column names.
    - cont_columns (list): A list of continuous column names.
    - disc_columns (list): A list of discrete column names.
    - labeled (bool): Indicates if the dataset is labeled.
    - label_column (str or list): The label column name(s).
    
    Methods:
    - check_columns(all_columns, *columns_groups): Checks if the given columns match the dataset columns.
    - not_label_columns(date=False): Returns a list of column names excluding the label column(s).
    - set_sample(sample, type, name, subtype=None): Sets a sample for a specific type and name.
    - get_sample(type, name, subtype=None): Retrieves a sample for a specific type and name.

    """

    def __init__(self, df, cat_columns, date_columns, cont_columns, disc_column, label_column=None):
        self._df = df
        self.cat_columns = cat_columns
        self.date_columns = date_columns
        self.cont_columns = cont_columns
        self.disc_columns = disc_column
        
        if isinstance(label_column, str):
            label_column = [label_column]
            
        if label_column is not None:
            self.labeled = True
            self.label_column = label_column
        else:
            self.labeled = False
            self.label_column = []
        
    def check_columns(self, all_columns, *columns_groups):
        """
        Checks if the given columns match the dataset columns.

        Parameters:
        - all_columns (list): A list of all column names.
        - *columns_groups (list): Variable number of lists containing column names.

        Raises:
        - ValueError: If the columns do not match.

        """
        columns = []
        for group in columns_groups:
            if group:
                columns += group
        if len(all_columns) == len(columns):
            raise ValueError(f"Columns do not match: {set(all_columns).difference(columns)}")        
        
    @property
    def not_label_columns(self, date=False):
        """
        Returns a list of column names excluding the label column(s).

        Parameters:
        - date (bool, optional): Indicates if date columns should be included. Defaults to False.

        Returns:
        - list: A list of column names.

        """
        if not date:
            return self.cat_columns + self.cont_columns + self.disc_columns
        else:
            return self.cat_columns + self.date_columns + self.cont_columns + self.disc_columns

    @property
    def df(self):
        """
        Returns the input dataframe.

        Returns:
        - pandas.DataFrame: The input dataframe.

        """
        return self._df
    
    @df.setter
    def df(self, df):
        """
        Sets the input dataframe.

        Parameters:
        - df (pandas.DataFrame): The input dataframe.

        """
        self._df = df

    def set_sample(self, sample, type, name, subtype=None):
        """
        Sets a sample for a specific type and name.

        Parameters:
        - sample: The sample to be set.
        - type (str): The type of the sample (train, validation, or test).
        - name (str): The name of the sample (X or Y).
        - subtype (str, optional): The subtype of the sample (scaled or None). Defaults to None.

        Raises:
        - ValueError: If the type, name, or subtype is invalid.

        """
        if type not in ["train", "validation", "test"]:
            raise ValueError("Type must be train, validation, or test")
        if name not in ["X", "Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled", None]:
            raise ValueError("Subtype must be scaled or none")
        
        att_name = name + "_" + type if subtype is None else name + "_" + type + "_" + subtype

        setattr(self, att_name, sample)
    
    def get_sample(self, type, name, subtype=None):
        """
        Retrieves a sample for a specific type and name.

        Parameters:
        - type (str): The type of the sample (train, validation, or test).
        - name (str): The name of the sample (X or Y).
        - subtype (str, optional): The subtype of the sample (scaled or None). Defaults to None.

        Returns:
        - The requested sample.

        Raises:
        - ValueError: If the type, name, or subtype is invalid.

        """
        if type not in ["train", "validation", "test"]:
            raise ValueError("Type must be train, validation, or test")
        if name not in ["X", "Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled", None]:
            raise ValueError("Subtype must be scaled or none")

        att_name = name + "_" + type if subtype is None else name + "_" + type + "_" + subtype

        return getattr(self, att_name)
    

class Dataset:

    def __init__(self,df,cat_columns,date_columns,cont_columns,disc_column,label_column=None):
        self._df=df
        self.cat_columns= cat_columns
        self.date_columns= date_columns
        self.cont_columns= cont_columns
        self.disc_columns= disc_column
        
        if isinstance(label_column,str):
            label_column=[label_column]
            
        if label_column is not None:
            self.labeled=True
            self.label_column=label_column
        else:
            self.labeled=False
            self.label_column=[]
            
        
    def check_columns(self,all_columns,*columns_groups):
        columns=[]
        for group in columns_groups:
            if group:
                columns+=group
        if len(all_columns)==len(columns):

            raise ValueError(f"Columns do not match: {set(all_columns).difference(columns)}")        
        
       
    @property
    def not_label_columns(self,date=False):
        if not date:
            return self.cat_columns+self.cont_columns+self.disc_columns
        else:
            return self.cat_columns+self.date_columns+self.cont_columns+self.disc_columns


    @property
    def df(self):
        return self._df
    
    @df.setter
    def df(self,df):
        self._df=df

    def set_sample(self,sample,type,name,subtype=None):
        if type  not in ["train","validation","test"]:
            raise ValueError("Type must be train, validation or test")
        if name not in ["X","Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled",None]:
            raise ValueError("Subtype must be scaled or none")
        
        att_name=name+"_"+type if subtype is None else name+"_"+type+"_"+subtype

        setattr(self,att_name,sample)
    
    def get_sample(self,type,name,subtype=None):
        if type  not in ["train","validation","test"]:
            raise ValueError("Type must be train, validation or test")
        if name not in ["X","Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled",None]:
            raise ValueError("Subtype must be scaled or none")

                
        att_name=name+"_"+type if subtype is None else name+"_"+type+"_"+subtype

        return getattr(self,att_name)



### Dataloader


In [None]:
class Loader:
    """
    A class that loads and processes a dataset.

    Parameters:
    path (str): The path to the CSV file containing the dataset.
    cat_columns (list): A list of column names that are categorical variables.
    date_columns (list): A list of column names that are date variables.
    cont_columns (list): A list of column names that are continuous variables.
    disc_columns (list): A list of column names that are discrete variables.
    label_columns (list, optional): A list of column names that are labels. Defaults to None.

    Attributes:
    _dataset (Dataset): An instance of the Dataset class that represents the loaded dataset.
    """

    def __init__(self, path, cat_columns, date_columns, cont_columns, disc_columns, label_columns=None):
        """
        Initializes a Loader object.

        Loads the dataset from the specified CSV file and creates a Dataset object.

        Parameters:
        path (str): The path to the CSV file containing the dataset.
        cat_columns (list): A list of column names that are categorical variables.
        date_columns (list): A list of column names that are date variables.
        cont_columns (list): A list of column names that are continuous variables.
        disc_columns (list): A list of column names that are discrete variables.
        label_columns (list, optional): A list of column names that are labels. Defaults to None.
        """
        df = pd.read_csv(path, dtype=str)
        self._dataset = Dataset(df, cat_columns, date_columns, cont_columns, disc_columns, label_columns)

    def rename_columns(self, columns_map):
        """
        Renames the columns of the dataset.

        Parameters:
        columns_map (dict): A dictionary mapping old column names to new column names.
        """
        self._dataset.df = self._dataset.df.rename(columns=columns_map)
        
    @property
    def dataset(self):
        """
        Returns the dataset.

        Returns:
        Dataset: An instance of the Dataset class representing the loaded dataset.
        """
        return self._dataset
    


class Loader:

    def __init__(self,path,cat_columns,date_columns,cont_columns,disc_columns,label_columns=None):
        df=pd.read_csv(path,dtype=str)
        self._dataset=Dataset(df,cat_columns,date_columns,cont_columns,disc_columns,label_columns)

    def rename_columns(self,columns_map):
        self._dataset.df=self._dataset.df.rename(columns=columns_map)
        
    @property
    def dataset(self):
        return self._dataset
        


### Preprocessamaneto

In [None]:


class Preprocessor:
    """
    A class that provides preprocessing methods for a given dataset.

    Args:
        dataset (Dataset): The dataset object containing the data to be preprocessed.

    Attributes:
        dataset (Dataset): The dataset object containing the data to be preprocessed.
        df (pandas.DataFrame): The DataFrame representation of the dataset.
        cat_columns (list): The list of categorical column names in the dataset.
        date_columns (list): The list of date column names in the dataset.
        cont_columns (list): The list of continuous column names in the dataset.
        disc_columns (list): The list of discrete column names in the dataset.

    Methods:
        _convert_cat(): Converts categorical columns to the 'category' data type.
        _convert_date(): Converts date columns to the 'datetime' data type.
        _convert_cont(): Converts continuous columns to the 'float' data type.
        _convert_discrete(): Converts discrete columns to the 'float' and then 'int' data type.
        fill_na(fill_value=""): Fills missing values in the dataset with the specified fill value.
        na_analysis(): Performs missing value analysis and returns the count of missing values for each column.
        drop_na(): Drops rows with missing values from the dataset.
        replacements(columns, replacements): Replaces substrings in the specified columns with the specified replacements.
        create_dummies(columns): Creates dummy variables for the specified categorical columns.
        create_date(day_column, month_column, year_column, date_column=None): Creates a new date column from the specified day, month, and year columns.
        process_date(): Processes date columns and extracts additional features such as weekday, day, month, and year.
        process_labels(): Converts the label column to categorical codes.
        cat_to_codes(columns=None): Converts categorical columns to categorical codes.
        set_types(): Sets the data types of the columns in the dataset.
        update_dataset(): Updates the dataset object with the preprocessed data.

    """

    def __init__(self, dataset: Dataset):
        self.dataset = dataset
        self.df = dataset.df
        self.cat_columns = dataset.cat_columns
        self.date_columns = dataset.date_columns
        self.cont_columns = dataset.cont_columns
        self.disc_columns = dataset.disc_columns
        self.dataset.df

    def _convert_cat(self):
        self.df.loc[:, self.cat_columns] = self.df[self.cat_columns].astype('category')

    def _convert_date(self):
        for col in self.date_columns:
            self.df[col] = pd.to_datetime(self.df[col], errors='coerce')

    def _convert_cont(self):
        self.df[self.cont_columns] = self.df[self.cont_columns].astype(float)

    def _convert_discrete(self):
        self.df[self.disc_columns] = self.df[self.disc_columns].astype(float).astype(int)

    def fill_na(self, fill_value=""):
        self.df = self.df.fillna(fill_value)

    def na_analysis(self):
        return self.df.isna().sum()

    def drop_na(self):
        self.df = self.df.dropna()

    def replacements(self, columns, replacements):
        if isinstance(replacements, str):
            replacements = [replacements] * len(columns)
        if isinstance(columns, str):
            columns = [columns]
        if len(columns) != len(replacements):
            raise ValueError("Columns and replacements must have the same length")

        for column, replacement in zip(columns, replacements):
            self.df[column] = self.df[column].str.replace(replacement, "")

    def create_dummies(self, columns):
        new_columns = []
        for column in columns:
            new_columns += list(map(lambda x: column + "_" + x, self.df[column].unique().tolist()))
        self.df = pd.get_dummies(self.df, columns=columns)
        for column in columns:
            if column in self.cat_columns:
                self.cat_columns.remove(column)
        self.cat_columns += new_columns

    def create_date(self, day_column, month_column, year_column, date_column=None):
        if date_column is None:
            date_column = "data_" + day_column.split("_")[-1]
        self.df[date_column] = pd.to_datetime(self.df[day_column].astype(float).astype(int).astype(str) \
                                              + "-" + self.df[month_column].astype(float).astype(int).astype(str) \
                                              + "-" + self.df[year_column].astype(float).astype(int).astype(str) \
                                              , errors='coerce')
        self.date_columns.append(date_column)

    def process_date(self):
        for name, func in zip(["weekday", "day", "month", "year"],
                              [lambda x: x.dt.weekday, lambda x: x.dt.day, lambda x: x.dt.month, lambda x: x.dt.year]):
            columns = list(map(lambda x: x + "_" + name, self.date_columns))
            self.df[columns] = self.df[self.date_columns].apply(func)
            self.disc_columns += columns

    def process_labels(self):
        self.df[self.dataset.label_column] = self.df[self.dataset.label_column].apply(
            lambda x: x.astype('category').cat.codes)

    def cat_to_codes(self, columns=None):
        if columns is None:
            columns = self.cat_columns
        print(columns)
        self.df[columns] = self.df[columns].apply(lambda x: x.astype('category').cat.codes)

    def set_types(self):
        # self._convert_cat()
        self._convert_date()
        self._convert_cont()
        self._convert_discrete()

    def update_dataset(self):
        self.dataset.df = self.df
        self.dataset.cat_columns = self.cat_columns
        self.dataset.date_columns = self.date_columns
        self.dataset.cont_columns = self.cont_columns
        self.dataset.disc_columns = self.disc_columns

        
class   Preprocessor:
    def __init__(self,dataset:Dataset):
        self.dataset=dataset
        self.df=dataset.df
        self.cat_columns= self.dataset.cat_columns
        self.date_columns= self.dataset.date_columns
        self.cont_columns= self.dataset.cont_columns
        self.disc_columns= self.dataset.disc_columns
        self.dataset.df
        
    def _convert_cat(self):
        self.df.loc[:,self.cat_columns]=self.df[self.cat_columns].astype('category')
    
    def _convert_date(self):
        for col in self.date_columns:
            self.df[col]=pd.to_datetime(self.df[col],errors='coerce')
    
    def _convert_cont(self):
        self.df[self.cont_columns]=self.df[self.cont_columns].astype(float)
  

    def _covert_discrete(self):
        self.df[self.disc_columns]=self.df[self.disc_columns].astype(float).astype(int)
       

    def fill_na(self,fill_value=""):
        self.df=self.df.fillna(fill_value)
    
    def na_analysis(self):
        return self.df.isna().sum()
    
    def drop_na(self):
        self.df=self.df.dropna()
    
    
    def replacements(self,columns,replacements):
        if isinstance(replacements,str):
            replacements=[replacements]*len(columns)
        if isinstance(columns,str):
            columns=[columns]
        if len(columns)!=len(replacements):
            raise ValueError("Columns and replacements must have the same length")

        for column,replacement in zip(columns,replacements):
            self.df[column]=self.df[column].str.replace(replacement,"")

    def create_dummies(self,columns):
        new_columns=[]
        for column in columns: new_columns+=list(map(lambda x:column+"_"+x, self.df[column].unique().tolist()))
        self.df=pd.get_dummies(self.df,columns=columns)
        for column in columns:
            if column in self.cat_columns:
                self.cat_columns.remove(column)
        self.cat_columns+=new_columns        
    
    def create_date(self,day_column,month_column,year_column,date_column=None):
        if date_column is None:
            date_column="data_"+day_column.split("_")[-1]
        self.df[date_column]=pd.to_datetime(self.df[day_column].astype(float).astype(int).astype(str)\
                                                +"-"+self.df[month_column].astype(float).astype(int).astype(str)\
                                                +"-"+self.df[year_column].astype(float).astype(int).astype(str)\
                                        ,errors='coerce')
        self.date_columns.append(date_column)
    

    def process_date(self):        
        for name,func in zip(["weekday","day","month","year"],[lambda x: x.dt.weekday,lambda x: x.dt.day,lambda x: x.dt.month,lambda x: x.dt.year]):
            columns=list(map(lambda x: x+"_"+name,self.date_columns))
            self.df[columns]=self.df[self.date_columns].apply(func)
            self.disc_columns+=columns
    
    def process_labels(self):
        self.df[self.dataset.label_column]=self.df[self.dataset.label_column].apply(lambda x: x.astype('category').cat.codes)

    def cat_to_codes(self,columns=None):
        if columns is None:
            columns=self.cat_columns
        print(columns)
        self.df[columns]=self.df[columns].apply(lambda x: x.astype('category').cat.codes)
        # for column in columns:
            # self.df.iloc[:,column]=self.df[column].astype("category").cat.codes
        
    def set_types(self):       
        # self._convert_cat()
        self._convert_date()
        self._convert_cont()
        self._covert_discrete()     

    def update_dataset(self):
        self.dataset.df=self.df
        self.dataset.cat_columns=self.cat_columns
        self.dataset.date_columns=self.date_columns
        self.dataset.cont_columns=self.cont_columns
        self.dataset.disc_columns=self.disc_columns


### Preprocessamneto de machine learning

In [None]:


class MLPreprocessing():
    """
    Class for performing preprocessing tasks on machine learning datasets.

    Attributes:
        dataset (Dataset): The dataset object containing the data.
        scaler (object): The scaler object used for feature scaling.

    Methods:
        undersample: Undersamples the dataset to balance the classes.
        set_scaler: Sets the scaler object for feature scaling.
        scale: Scales the features in the dataset.
        split: Splits the dataset into train, validation, and test sets.
        cross_validation: Performs cross-validation on the dataset.
    """

    def _split(X, Y, **kwargs):
        """
        Helper function to split the dataset into train and test sets.

        Args:
            X (DataFrame): The input features.
            Y (Series): The target variable.
            **kwargs: Additional arguments to pass to the train_test_split function.

        Returns:
            X_train (DataFrame): The training set input features.
            X_test (DataFrame): The test set input features.
            Y_train (Series): The training set target variable.
            Y_test (Series): The test set target variable.
        """
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, **kwargs)
        return X_train, X_test, Y_train, Y_test

    def __init__(self, dataset):
        """
        Initializes the MLPreprocessing object.

        Args:
            dataset (Dataset): The dataset object containing the data.
        """
        self.dataset = dataset
        self.dataset.__setattr__("scalable_columns", self.dataset.cont_columns + self.dataset.disc_columns)
        self.dataset.__setattr__("not_scalable_columns", self.dataset.cat_columns)

    def undersample(self):
        """
        Undersamples the dataset to balance the classes.
        """
        positive_sample_size = (self.dataset.df[self.dataset.label_column[0]] == 1).sum()
        self.dataset.__setattr__("full_df", self.dataset.df)
        self.dataset.df = pd.concat([
            self.dataset.df[self.dataset.df[self.dataset.label_column[0]] == 0].sample(positive_sample_size),
            self.dataset.df[self.dataset.df[self.dataset.label_column[0]] == 1]
        ], axis=0)
        print(positive_sample_size)

    def set_scaler(self, type="minmax"):
        """
        Sets the scaler object for feature scaling.

        Args:
            type (str): The type of scaler to use. Default is "minmax".
        """
        scalers = {"minmax": MinMaxScaler, "standard": StandardScaler, "robust": RobustScaler,
                   "normalizer": RobustScaler}
        self.scaler = scalers[type]()

    def scale(self, sample=["train", "validation", "test"]):
        """
        Scales the features in the dataset.

        Args:
            sample (list): The samples to scale. Default is ["train", "validation", "test"].

        Raises:
            ValueError: If the scaler was not set or the train data was not set.
        """
        if not hasattr(self, "scaler"):
            raise ValueError("Scaler was not set")
        if not hasattr(self.dataset, "X_train"):
            raise ValueError("Train data was not set")

        X_train_scaled = self.scaler.fit_transform(self.dataset.X_train[self.dataset.scalable_columns].astype(float))
        X_train_scaled = pd.concat([
            self.dataset.X_train[self.dataset.not_scalable_columns].reset_index(drop=True),
            pd.DataFrame(X_train_scaled, columns=self.dataset.scalable_columns)
        ], axis=1)
        self.dataset.set_sample(X_train_scaled, "train", "X", subtype="scaled")

        if "test" in sample and hasattr(self.dataset, "X_test"):
            X_test_scaled = self.scaler.transform(self.dataset.X_test[self.dataset.scalable_columns])
            X_test_scaled = pd.concat([
                self.dataset.X_test[self.dataset.not_scalable_columns].reset_index(drop=True),
                pd.DataFrame(X_test_scaled, columns=self.dataset.scalable_columns)
            ], axis=1)
            self.dataset.set_sample(X_test_scaled, "test", "X", subtype="scaled")

        if "validation" in sample and hasattr(self.dataset, "X_validation"):
            X_val_scaled = self.scaler.transform(self.dataset.X_validation[self.dataset.scalable_columns])
            X_val_scaled = pd.concat([
                self.dataset.X_validation[self.dataset.not_scalable_columns].reset_index(drop=True),
                pd.DataFrame(X_val_scaled, columns=self.dataset.scalable_columns)
            ], axis=1)
            self.dataset.set_sample(X_val_scaled, "validation", "X", subtype="scaled")

    def split(self, validation=True, stratified=True, columns_stratify=None):
        """
        Splits the dataset into train, validation, and test sets.

        Args:
            validation (bool): Whether to include a validation set. Default is True.
            stratified (bool): Whether to perform stratified sampling. Default is True.
            columns_stratify (list): The columns to use for stratified sampling. Default is None.

        Raises:
            ValueError: If the dataset is not labeled.
            NotImplemented: If stratified sampling is requested but columns_stratify is not provided.
        """
        train_size = 0.7 if validation else 0.80
        validation_size = 0.15
        test_size = 0.15 if validation else 0.20
        if not self.dataset.labeled:
            raise ValueError("Dataset is not labeled")

        if stratified and columns_stratify is not None:
            columns_stratify += self.dataset.label_column

        X_train, X_test, Y_train, Y_test = MLPreprocessing._split(
            self.dataset.df[self.dataset.not_label_columns],
            self.dataset.df[self.dataset.label_column],
            random_state=42,
            test_size=test_size + validation_size,
            stratify=columns_stratify
        )
        if validation:
            X_test, X_val, Y_test, Y_val = MLPreprocessing._split(
                X_test, Y_test, random_state=42, test_size=validation_size / (test_size + validation_size),
                stratify=columns_stratify
            )
        else:
            X_val, Y_val = None, None

        self.dataset.set_sample(X_train, "train", "X")
        self.dataset.set_sample(Y_train, "train", "Y")
        self.dataset.set_sample(X_val, "validation", "X")
        self.dataset.set_sample(Y_val, "validation", "Y")
        self.dataset.set_sample(X_test, "test", "X")
        self.dataset.set_sample(Y_test, "test", "Y")

    def cross_validation(self):
        """
        Placeholder method for performing cross-validation on the dataset.
        """
        pass

class MLPreprocessing():


    def _split(X,Y,**kwargs):

        X_train, X_test, Y_train, Y_test=train_test_split(X,Y,**kwargs)

        return X_train, X_test, Y_train, Y_test
   

    def __init__(self,dataset:Dataset) -> None:
        self.dataset=dataset
        self.dataset.__setattr__("scalable_columns",self.dataset.cont_columns+self.dataset.disc_columns)
        self.dataset.__setattr__("not_scalable_columns",self.dataset.cat_columns)

    def undersample(self):
        positive_sample_size=(self.dataset.df[self.dataset.label_column[0]]==1).sum()
        self.dataset.__setattr__("full_df",self.dataset.df)
        self.dataset.df=pd.concat([self.dataset.df[self.dataset.df[self.dataset.label_column[0]]==0].sample(positive_sample_size),
                                   self.dataset.df[self.dataset.df[self.dataset.label_column[0]]==1]],axis=0)
        print(positive_sample_size)


    def set_scaler(self,type="minmax"):
        scalers={"minmax":MinMaxScaler,"standard":StandardScaler,"robust":RobustScaler,"normalizer":RobustScaler}
        self.scaler=scalers[type]()

    def scale(self,sample=["train","validation","test"]):

        if not hasattr(self,"scaler"):
            raise ValueError("Scaler was not set")
        if not hasattr(self.dataset,"X_train"):
            raise ValueError("Train data was not set")        

        X_train_scaled=self.scaler.fit_transform(self.dataset.X_train[self.dataset.scalable_columns].astype(float))
        X_train_scaled=pd.concat([self.dataset.X_train[self.dataset.not_scalable_columns].reset_index(drop=True),
                                  pd.DataFrame(X_train_scaled,columns=self.dataset.scalable_columns)],axis=1)
        self.dataset.set_sample(X_train_scaled,"train","X",subtype="scaled")

        if "test" in sample and hasattr(self.dataset,"X_test"):
            X_test_scaled=self.scaler.transform(self.dataset.X_test[self.dataset.scalable_columns])
            X_test_scaled=pd.concat([self.dataset.X_test[self.dataset.not_scalable_columns].reset_index(drop=True),
                                     pd.DataFrame(X_test_scaled,columns=self.dataset.scalable_columns)],axis=1)
            self.dataset.set_sample(X_test_scaled,"test","X",subtype="scaled")

        if "validation" in sample and hasattr(self.dataset,"X_validation"):
            X_val_scaled=self.scaler.transform(self.dataset.X_validation[self.dataset.scalable_columns])
            X_val_scaled=pd.concat([self.dataset.X_validation[self.dataset.not_scalable_columns].reset_index(drop=True),
                                    pd.DataFrame(X_val_scaled,columns=self.dataset.scalable_columns)],axis=1)
            self.dataset.set_sample(X_val_scaled,"validation","X",subtype="scaled")
    

    def split(self,validation=True,stratified=True,columns_stratify=None):

         
        train_size=0.7 if validation else 0.80
        validation_size=0.15
        test_size=0.15 if validation else 0.20
        if not self.dataset.labeled:
            raise NotImplemented()
        
        if stratified and columns_stratify is not None:            
            columns_stratify+=self.dataset.label_column

              
        X_train, X_test, Y_train, Y_test=MLPreprocessing._split(self.dataset.df[self.dataset.not_label_columns],self.dataset.df[self.dataset.label_column],random_state=42,test_size=test_size+validation_size,stratify=columns_stratify)        
        if validation:
                X_test, X_val, Y_test, Y_val=MLPreprocessing._split(X_test,Y_test,random_state=42,test_size=validation_size/(test_size+validation_size),stratify=columns_stratify)
        else:
            X_val,Y_val=None,None
        
        self.dataset.set_sample(X_train,"train","X")
        self.dataset.set_sample(Y_train,"train","Y")
        self.dataset.set_sample(X_val,"validation","X")
        self.dataset.set_sample(Y_val,"validation","Y")
        self.dataset.set_sample(X_test,"test","X")
        self.dataset.set_sample(Y_test,"test","Y")



    def cross_validation():
        pass


### Classe Modelo

In [None]:
class Model(BaseEstimator):
    """
    A class representing a machine learning model.

    Parameters:
    - model: The machine learning model to be used.
    - params: The parameters for the model.
    - supervised: A boolean indicating whether the model is supervised or not.
    - run_scaled: A boolean indicating whether to run the model on scaled data.
    - run_on_categorical: A boolean indicating whether to run the model on categorical columns.
    - run_on_continues: A boolean indicating whether to run the model on continuous columns.
    - created: A boolean indicating whether the model has been created or not.

    Methods:
    - camel_case_split: Splits a camel case string into separate words.
    - __init__: Initializes the Model object.
    - get_X: Retrieves the X data for a given type and dataset.
    - set_params: Sets the parameters for the model.
    - grid_search_params: Sets the parameters for grid search.
    - random_grid_search_params: Sets the parameters for random grid search.
    - model: Returns the model object.
    - set_metrics: Sets the metrics for the model.
    - show_metrics: Returns the metrics for the model.
    - fit: Fits the model to the data.
    - __str__: Returns a string representation of the model.
    - __call__: Makes predictions using the model.

    Attributes:
    - _model: The machine learning model.
    - params: The parameters for the model.
    - supervised: A boolean indicating whether the model is supervised or not.
    - created: A boolean indicating whether the model has been created or not.
    - run_scaled: A boolean indicating whether to run the model on scaled data.
    - run_on_categorical: A boolean indicating whether to run the model on categorical columns.
    - run_on_continues: A boolean indicating whether to run the model on continuous columns.
    - grid_search_params: The parameters for grid search.
    - metrics: The metrics for the model.
    """

    @staticmethod
    def camel_case_split(str):     
        """
        Splits a camel case string into separate words.

        Parameters:
        - str: The camel case string to be split.

        Returns:
        - A list of words.
        """
        start_idx = [i for i, e in enumerate(str)
                    if e.isupper()] + [len(str)]
    
        start_idx = [0] + start_idx
        return [str[x: y] for x, y in zip(start_idx, start_idx[1:])]         
    
    
    def __init__(self,model,params,supervised:bool,run_scaled:bool=False,run_on_categorical=True,run_on_continues=True,created=False) -> None:
        """
        Initializes the Model object.

        Parameters:
        - model: The machine learning model to be used.
        - params: The parameters for the model.
        - supervised: A boolean indicating whether the model is supervised or not.
        - run_scaled: A boolean indicating whether to run the model on scaled data.
        - run_on_categorical: A boolean indicating whether to run the model on categorical columns.
        - run_on_continues: A boolean indicating whether to run the model on continuous columns.
        - created: A boolean indicating whether the model has been created or not.
        """
        self._model=model
        self.params=params
        self.supervised=supervised
        if not created:
            self.set_params()
        else:
            self.created=created
        self.run_scaled=run_scaled
        self.run_on_categorical=run_on_categorical
        self.run_on_continues=run_on_continues
        if not (run_on_categorical or run_on_continues):
            raise ValueError("Model must run on categorical and/or continuous columns")
        
    def get_X(self,type:Union["train","test","validation"],dataset:Dataset):       
        """
        Retrieves the X data for a given type and dataset.

        Parameters:
        - type: The type of data to retrieve (train, test, or validation).
        - dataset: The dataset object.

        Returns:
        - The X data.
        """
        subtype="scaled" if self.run_scaled else None
        columns=dataset.not_label_columns
        if not self.run_on_categorical:
            columns=list(set(columns).difference(dataset.cat_columns))
        if not self.run_on_continues:
            columns=list(set(columns).difference(dataset.cont_columns))
            
        return dataset.get_sample(type,"X",subtype=subtype)[columns]
    
    def set_params(self):
        """
        Sets the parameters for the model.
        """
        self.created=True
        self._model=self._model(**self.params)
    
    def grid_search_params(self,**params):
        """
        Sets the parameters for grid search.

        Parameters:
        - params: The parameters for grid search.
        """
        self.grid_search_params=params

    def random_grid_search_params(self,**params):
        """
        Sets the parameters for random grid search.

        Parameters:
        - params: The parameters for random grid search.
        """
        self.grid_search_params=params

    @property
    def model(self):
        """
        Returns the model object.

        Raises:
        - ValueError: If the model has not been set.
        """
        if not hasattr(self,"created"):
            raise ValueError("Model was not set")
        
        return self._model        

    def set_metrics(self,metric,value):
        """
        Sets the metrics for the model.

        Parameters:
        - metric: The metric name.
        - value: The metric value.
        """
        if not hasattr(self,"metrics"):
            self.metrics={}
        self.metrics[metric]=value

    def show_metrics(self):
        """
        Returns the metrics for the model.
        """
        return self.metrics
    
    def fit(self,Y,X=None,type=None,dataset:Dataset=None,**kwargs):
        """
        Fits the model to the data.

        Parameters:
        - Y: The target variable.
        - X: The input data.
        - type: The type of data (train, test, or validation).
        - dataset: The dataset object.
        - kwargs: Additional keyword arguments for the fit method.

        Returns:
        - The fitted model.
        """
        if X is None and type is not None and dataset is not None:
            X=self.get_X(type,dataset)
            
        return self._model.fit(X,Y,**kwargs)

    def __str__(self) -> str:
        """
        Returns a string representation of the model.
        """
        name=(self.model.__str__()).split("(")[0]
        if self.run_scaled:
            name+="_scaled"
        if self.run_on_categorical:
            name+="_cat"
        if self.run_on_continues:
            name+="_cont"
        return name
            
    def __call__(self,X=None,type=None,dataset:Dataset=None):
        """
        Makes predictions using the model.

        Parameters:
        - X: The input data.
        - type: The type of data (train, test, or validation).
        - dataset: The dataset object.

        Returns:
        - The predictions.
        """
        if type is not None and Dataset is not None:
            X=self.get_X(type,dataset)
        if X is None:
            raise ValueError("X or type and Dataset must be passed")
        
        return self._model.predict(X)



### Treinador de modelos

In [None]:
class Trainer:
    """
    The Trainer class is responsible for training machine learning models.

    Args:
        models (List[BaseEstimator]): A list of machine learning models to be trained.
        dataset (Dataset): The dataset used for training and evaluation.

    Attributes:
        models (List[BaseEstimator]): A list of machine learning models.
        dataset (Dataset): The dataset used for training and evaluation.

    Methods:
        _train(model: Model, **kwargs): Trains a single machine learning model.
        train(**kwargs): Trains all models or a specific model.
        run_grid_search(random_state=123, n_iter=10): Runs grid search for models with grid search parameters.
        run_evaluation(): Evaluates all models.
        add(model: Model, train=True): Adds a new model to the list of models.

    """

    def __init__(self, models: List[BaseEstimator], dataset: Dataset) -> None:
        if not isinstance(models, list):
            models = [models]
            
        self.models = models
        self.dataset = dataset

    def _train(self, model: Model, **kwargs):
        """
        Trains a single machine learning model.

        Args:
            model (Model): The machine learning model to be trained.
            **kwargs: Additional keyword arguments to be passed to the model's fit method.

        """
        t1 = time.time()
        if model.supervised:
            model.fit(type="train", dataset=self.dataset, Y=self.dataset.Y_train, **kwargs)
        else:
            model.fit(tpe="train", dataset=self.dataset, **kwargs)
             
        t2 = time.time()
        model.__setattr__("training_time", t2 - t1)

    def train(self, **kwargs):
        """
        Trains all models or a specific model.

        Args:
            **kwargs: Additional keyword arguments to be passed to the _train method.
                If "model" is provided, only that specific model will be trained.

        """
        if "model" not in kwargs:            
            for model in self.models:                    
                self._train(model, **kwargs)
        else:
            model = kwargs.pop("model")
            self._train(model, **kwargs)


    def run_evaluation(self):
        """
        Evaluates all models.

        """
        for model in self.models:
            model.evaluate()

    def add(self, model: Model, train=True):
        """
        Adds a new model to the list of models.

        Args:
            model (Model): The machine learning model to be added.
            train (bool): Whether to train the model after adding it. Default is True.

        """
        self.models.append(model)
        if train:
            self.train(model=model)

            
class Trainer():
    def __init__(self,models:List[BaseEstimator],dataset:Dataset) -> None:
        if not isinstance(models,list):
            models=[models]
            
        self.models=models
        self.dataset=dataset




    def _train(self,model:Model,**kwargs):
            t1=time.time()
            if model.supervised:
                    model.fit(type="train",dataset=self.dataset,Y=self.dataset.Y_train,**kwargs)
            
            else:
                    model.fit(tpe="train",dataset=self.dataset,**kwargs)
             
            t2=time.time()
            model.__setattr__("training_time",t2-t1)
        

    def train(self,**kwargs):
        if "model" not in kwargs:            
            for model in self.models:                    
                self._train(model,**kwargs)
        else:
            model=kwargs.pop("model")
            self._train(model,**kwargs)
        


    def run_grid_search(self,random_state=123,n_iter=10):
        np.random.seed(random_state)

        for model in self.models:
            if not hasattr(model,"grid_search_params"):
                continue
            else:
                parameters=ParameterSampler(model.grid_search_params,n_iter=n_iter,random_state=random_state)
                for param in parameters:
                    model.copy().set_params(**param)
                    self._train(model)
            
                         

    def run_evaluation(self):
        for model in self.models:
            model.evaluate()

    def add(self,model:Model,train=True):
        self.models.append(model)
        if train:
            self.train(model=model)


### Avaliação de Modelos

In [None]:

class Evaluation():
    """
    Class to perform evaluation of models on a dataset.

    Args:
        models (List[Model]): A list of models to evaluate.
        dataset (Dataset): The dataset to evaluate the models on.

    Attributes:
        dataset (Dataset): The dataset used for evaluation.
        models (List[Model]): The models to evaluate.

    Methods:
        run(sample="validation"): Runs the evaluation on the specified sample.
        metrics(): Returns the evaluation metrics for each model.
        plot_metrics(orient="model"): Plots the evaluation metrics for each model.

    """

    def __init__(self, models: List[Model], dataset: Dataset) -> None:
        if not isinstance(models, list):
            models = [models]
            
        self.dataset = dataset
        self.models = models


    def run(self, sample="validation"):
        """
        Runs the evaluation on the specified sample.

        Args:
            sample (str, optional): The sample to evaluate the models on. Defaults to "validation".

        """
        for model in self.models:
            accuracy = accuracy_score(self.dataset.get_sample(sample, "Y"), model(type=sample, dataset=self.dataset))  
            precision = precision_score(self.dataset.get_sample(sample, "Y"), model(type=sample, dataset=self.dataset))
            recall = recall_score(self.dataset.get_sample(sample, "Y"), model(type=sample, dataset=self.dataset))
            f1 = f1_score(self.dataset.get_sample(sample, "Y"), model(type=sample, dataset=self.dataset))
            roc_auc = roc_auc_score(self.dataset.get_sample(sample, "Y"), model(type=sample, dataset=self.dataset))

            model.set_metrics("accuracy", accuracy)
            model.set_metrics("precision", precision)
            model.set_metrics("recall", recall)
            model.set_metrics("f1", f1)
            model.set_metrics("roc_auc", roc_auc)


    def metrics(self):
        """
        Returns the evaluation metrics for each model.

        Returns:
            dict: A dictionary containing the evaluation metrics for each model.

        """
        metrics = {}
        for model in self.models:
            metrics[str(model)] = model.show_metrics()
        return metrics
    
    def plot_metrics(self, orient="model"):
        """
        Plots the evaluation metrics for each model.

        Args:
            orient (str, optional): The orientation of the plot. Defaults to "model".

        Returns:
            matplotlib.figure.Figure: The plotted figure.

        """
        if orient == "model":
            fig, axs = plt.subplot_mosaic([["accuracy", "precision", "recall"], ["f1", "roc_auc", "vazio"]], sharey=True, figsize=(10, 4))
            for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
                values = []
                for model in self.models:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[metric].set_title(metric)
                sns.barplot(x=values, y=[str(model) for model in self.models], ax=axs[metric])
                axs[metric].set_xlim(0.5, 1)
            plt.tight_layout()
            return fig
        else:
            fig, axs = plt.subplots(len(self.models)//5+1, 5, sharey=True)
            if len(self.models)//5+1 == 1:
                axs = axs.reshape(1, -1)

            for index, model in enumerate(self.models):
                values = []
                for metric in ["accuracy", "precision", "recall", "f1", "roc_auc"]:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[index//5, index%5].set_title(model.model, fontsize=8)
                sns.barplot(x=values, y=["accuracy", "precision", "recall", "f1", "roc_auc"], ax=axs[index//5, index%5])
            plt.tight_layout()
            return fig



class Evaluation():
    
    def __init__(self,models:List[Model],dataset:Dataset) -> None:
        if not isinstance(models,list):
            models=[models]
            
        self.dataset=dataset
        self.models=models


    def run(self,sample="validation"):
        for model in self.models:
           
            accuracy=accuracy_score(self.dataset.get_sample(sample,"Y"),model(type=sample,dataset=self.dataset))  
            precision=precision_score(self.dataset.get_sample(sample,"Y"),model(type=sample,dataset=self.dataset))
            recall=recall_score(self.dataset.get_sample(sample,"Y"),model(type=sample,dataset=self.dataset))
            f1=f1_score(self.dataset.get_sample(sample,"Y"),model(type=sample,dataset=self.dataset))
            roc_auc=roc_auc_score(self.dataset.get_sample(sample,"Y"),model(type=sample,dataset=self.dataset))

            model.set_metrics("accuracy",accuracy)
            model.set_metrics("precision",precision)
            model.set_metrics("recall",recall)
            model.set_metrics("f1",f1)
            model.set_metrics("roc_auc",roc_auc)


    def metrics(self):
        metrics={}
        for model in self.models:
            metrics[str(model)]=model.show_metrics()
        return metrics
    
    def plot_metrics(self,orient="model"):
        if orient=="model":
            fig,axs=plt.subplot_mosaic([["accuracy","precision","recall"],["f1","roc_auc","vazio"]],sharey=True,figsize=(10,4))
            for metric in ["accuracy","precision","recall","f1","roc_auc"]:
                values=[]
                for model in self.models:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[metric].set_title(metric)
                sns.barplot(x=values,y=[str(model) for model in self.models],ax=axs[metric])
                axs[metric].set_xlim(0.5,1)
            plt.tight_layout()
            return fig
        else:
            fig,axs=plt.subplots(len(self.models)//5+1,5,sharey=True)
            if len(self.models)//5+1 ==1:
                axs=axs.reshape(1,-1)

            for index,model in enumerate(self.models):
                values=[]
                for metric in ["accuracy","precision","recall","f1","roc_auc"]:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[index//5,index%5].set_title(model.model,fontsize=8)
                sns.barplot(x=values,y=["accuracy","precision","recall","f1","roc_auc"],ax=axs[index//5,index%5])
            plt.tight_layout()
            return fig



### Seleção dos modelos treinados

In [None]:
class Selector():
    """
    A class that selects the best model based on a specified metric.
    
    Args:
        models (List[Model]): A list of models to select from.
        aimed_metric (str): The metric to optimize for. Defaults to "accuracy".
    
    Attributes:
        models (List[Model]): A list of models to select from.
        aimed_metric (str): The metric to optimize for.
    
    Methods:
        check_metric(): Checks if each model has the specified metric.
        select(): Selects the model with the highest value for the specified metric.
    """
    
    def __init__(self, models: List[Model], aimed_metric: str = "accuracy") -> None:
        """
        Initializes a Selector object.
        
        Args:
            models (List[Model]): A list of models to select from.
            aimed_metric (str): The metric to optimize for. Defaults to "accuracy".
        """
        if not isinstance(models, list):
            models = [models]
        
        self.models = models
        self.aimed_metric = aimed_metric
    
    def check_metric(self):
        """
        Checks if each model has the specified metric.
        
        Raises:
            ValueError: If a model does not have the specified metric.
        """
        for model in self.models:
            if not hasattr(model, "metrics"):
                raise ValueError(f"{str(model)} has no metrics")
            elif model.metrics.get(self.aimed_metric) is None:
                raise ValueError(f"{str(model)} has no metrics")
            
    def select(self):
        """
        Selects the model with the highest value for the specified metric.
        
        Returns:
            Model: The selected model.
        """
        scores = []
        for model in self.models:
            scores.append(model.metrics.get(self.aimed_metric))
        
        argmax = scores.index(max(scores))

        return self.models[argmax]




### Ajuste de Hiperparâmetros

In [None]:
class ModelTunning:
    """
    The ModelTunning class is used for hyperparameter tuning of machine learning models.
    It provides methods for performing random search and grid search to find the best set of hyperparameters.

    Attributes:
        model (Model): The machine learning model to be tuned.
        dataset (Dataset): The dataset used for training the model.
        cv (int): The number of cross-validation folds.
        random_state (int): The random seed for reproducibility.
        scoring_fn (str): The scoring function used for evaluating the models.

    Methods:
        ideal_cutoff(size, cutoff, max_iter): Recursive function to calculate the ideal cutoff value.
        RandomSearch(n_iter): Performs random search to find the best set of hyperparameters.
        GridSearch(max_iter, amplitude, cutoff, params): Performs grid search to find the best set of hyperparameters.
        best_model: Returns the best model found during the search.

    Usage:
        # Create an instance of ModelTunning
        tuner = ModelTunning(model, dataset, cv=10, random_state=123, scoring_fn="accuracy")

        # Perform random search
        tuner.RandomSearch(n_iter=10)

        # Perform grid search
        tuner.GridSearch(max_iter=30, amplitude=0.5, cutoff=5, params=None)

        # Get the best model
        best_model = tuner.best_model
    """

    def __init__(self, model: Model, dataset: Dataset, cv=10, random_state=123, scoring_fn="accuracy") -> None:
        """
        Initializes a new instance of the ModelTunning class.

        Args:
            model (Model): The machine learning model to be tuned.
            dataset (Dataset): The dataset used for training the model.
            cv (int, optional): The number of cross-validation folds. Defaults to 10.
            random_state (int, optional): The random seed for reproducibility. Defaults to 123.
            scoring_fn (str, optional): The scoring function used for evaluating the models. Defaults to "accuracy".
        """
        self.model = model
        self.dataset = dataset
        self.random_state = random_state
        self.cv = cv
        self.scoring_fn = scoring_fn

    def ideal_cutoff(size, cutoff, max_iter):
        """
        Recursive function to calculate the ideal cutoff value.

        Args:
            size (int): The number of hyperparameters.
            cutoff (int): The current cutoff value.
            max_iter (int): The maximum number of iterations.

        Returns:
            int: The ideal cutoff value.
        """
        if size * cutoff > max_iter:
            cutoff -= 1
            return ModelTunning.ideal_cutoff(size, cutoff, max_iter)
        else:
            return cutoff

    def RandomSearch(self, n_iter=10):
        """
        Performs random search to find the best set of hyperparameters.

        Args:
            n_iter (int, optional): The number of iterations. Defaults to 10.

        Returns:
            dict: The results of the random search.
        """
        search = RandomizedSearchCV(
            self.model.model,
            self.model.grid_search_params,
            n_iter=n_iter,
            random_state=self.random_state,
            scoring=self.scoring_fn,
            cv=self.cv
        )

        search.fit(self.model.get_X(type="train", dataset=self.dataset), self.dataset.Y_train)
        self.random_search = search
        self.random_search_best_model = copy.deepcopy(self.model)
        self.random_search_best_model._model = clone(self.random_search.best_estimator_)
        return search.cv_results_

    def GridSearch(self, max_iter=30, amplitude=0.5, cutoff=5, params=None):
        """
        Performs grid search to find the best set of hyperparameters.

        Args:
            max_iter (int, optional): The maximum number of iterations. Defaults to 30.
            amplitude (float, optional): The amplitude for generating parameter values. Defaults to 0.5.
            cutoff (int, optional): The cutoff value for generating parameter values. Defaults to 5.
            params (dict, optional): The hyperparameters to be tuned. If None, the best parameters from random search will be used. Defaults to None.

        Returns:
            dict: The results of the grid search.
        """
        if params is None:
            if not hasattr(self, "random_search"):
                raise ValueError("Params must be passed or RandomSearch must be run")

            params = self.random_search.best_params_
            if max_iter < 3 * len(params):
                print("Max_iter is less than the number of parameters. It is being set to 3 times the number of parameters")
                max_iter = 3 * len(params)

            cutoff = self.ideal_cutoff(len(params), cutoff, max_iter)

            for param, value in params.items():
                if isinstance(value, int):
                    params[param] = np.arange(value - min(cutoff, int(value * amplitude)), value + min(cutoff, int(value * amplitude)))
                elif isinstance(value, float):
                    params[param] = np.arange(value - min(cutoff, value * amplitude), value + min(cutoff, value * amplitude))

        fine_search = GridSearchCV(
            self.model.model,
            params,
            scoring=self.scoring_fn,
            cv=self.cv
        )

        fine_search.fit(self.model.get_X(type="train", dataset=self.dataset), self.dataset.Y_train)
        self.fine_search = fine_search
        self.grid_search_best_model = copy.deepcopy(self.model)
        self.grid_search_best_model._model = clone(self.random_search.best_estimator_)
        return fine_search.cv_results_

    @property
    def best_model(self):
        """
        Returns the best model found during the search.

        Raises:
            ValueError: If no search was run.

        Returns:
            Model: The best model.
        """
        if hasattr(self, "grid_search_best_model"):
            return self.grid_search_best_model
        elif hasattr(self, "random_search_best_model"):
            return self.random_search_best_model
        else:
            raise ValueError("No search was run")


class ModelTunning:
    
    @staticmethod
    def ideal_cutoff(size,cutoff,max_iter):
        
        if size*cutoff>max_iter:
            cutoff-=1
            return ModelTunning.ideal_cutoff(size,cutoff,max_iter)
            #return cutoff
        else:
            return cutoff
    def __init__(self,model:Model,dataset:Dataset,cv=10,random_state=123,scoring_fn="accuracy") -> None:
        self.model=model
        self.dataset=dataset
        self.random_state=random_state
        self.cv=cv
        self.scoring_fn=scoring_fn

    def RandomSearch(self,n_iter=10):
        search=RandomizedSearchCV(self.model.model,self.model.grid_search_params,n_iter=n_iter,random_state=self.random_state,
                                  scoring=self.scoring_fn,cv=self.cv)
        
        search.fit(self.model.get_X(type="train",dataset=self.dataset),self.dataset.Y_train)
        self.random_search=search
        self.random_search_best_model=copy.deepcopy(self.model)
        self.random_search_best_model._model=clone(self.random_search.best_estimator_)
        return search.cv_results_
    

    
    def GridSearch(self,max_iter=30,amplitude=0.5,cutoff=5,params=None):
        if params is None:            
            if not hasattr(self,"random_search"):
                raise ValueError("Params must be passed or RandomSearch must be run")
            
            params=self.random_search.best_params_
            if max_iter<3*len(params):
                print("Max_iter is less than the number of parameters. It is being set to 3 times the number of parameters")
                max_iter=3*len(params)

            
            cutoff=self.ideal_cutoff(len(params),cutoff,max_iter)
        
            for param,value in params.items():                
                if isinstance(value,int):
                    params[param]=np.arange(value-min(cutoff,int(value*amplitude)),value+min(cutoff,int(value*amplitude)))
                elif isinstance(value,float):
                    params[param]=np.arange(value-min(cutoff,value*amplitude),value+min(cutoff,value*amplitude))

                    
        fine_search=GridSearchCV(self.model.model,params,scoring=self.scoring_fn,cv=self.cv)       
        fine_search.fit(self.model.get_X(type="train",dataset=self.dataset),self.dataset.Y_train)
        self.fine_search=fine_search
        self.grid_search_best_model=copy.deepcopy(self.model)
        self.grid_search_best_model._model=clone(self.random_search.best_estimator_)
        return fine_search.cv_results_
    
    @property
    def best_model(self):
        if hasattr(self,"grid_search_best_model"):
            return self.grid_search_best_model
        elif hasattr(self,"random_search_best_model"):
            return self.random_search_best_model
        else:
            raise ValueError("No search was run")



### Orchestrador

In [None]:

class orchestrator():
    """
    The orchestrator class is responsible for coordinating the different steps of a machine learning pipeline.
    It takes in various components such as models, data loader, preprocessor, trainer, evaluator, model tunning, and model selector.
    The main purpose of this class is to provide a high-level interface to run the entire pipeline and obtain the best model.

    Args:
        models (list): A list of machine learning models to be trained and evaluated.
        loader (object): An object that loads the dataset.
        preprocessor_class (class): The class for preprocessing the dataset.
        ml_reprocessing_class (class): The class for preprocessing the dataset for machine learning.
        trainer_class (class): The class for training the models.
        evaluation_class (class): The class for evaluating the models.
        model_tunning_class (class): The class for tuning the hyperparameters of the models.
        model_selector_class (class): The class for selecting the best model based on a decision metric.

    Attributes:
        models (list): A list of machine learning models.
        loader (object): An object that loads the dataset.
        preprocessing_class (class): The class for preprocessing the dataset.
        ml_preprocessing_class (class): The class for preprocessing the dataset for machine learning.
        trainer_class (class): The class for training the models.
        evaluation_class (class): The class for evaluating the models.
        tunning_class (class): The class for tuning the hyperparameters of the models.
        model_selector_class (class): The class for selecting the best model based on a decision metric.
        preprocessor (object): An object that performs preprocessing on the dataset.
        MLpreprocessor (object): An object that performs preprocessing for machine learning on the dataset.
        trainer (object): An object that trains the models.
        evaluator (object): An object that evaluates the models.
        best_model (object): The best model selected based on a decision metric.
        tunning (object): An object that tunes the hyperparameters of the best model.
        bestmodel_MLpreprocessor (object): An object that performs preprocessing for machine learning on the dataset for the best model.
        bestmodel_trainer (object): An object that trains the best model.
        bestmodel_evaluator (object): An object that evaluates the best model.

    Methods:
        run_preprocessing: Performs preprocessing on the dataset.
        run_ml_preprocessing: Performs preprocessing for machine learning on the dataset.
        run_training: Trains the models.
        run_model_evaluation: Evaluates the models.
        run_model_selection: Selects the best model based on a decision metric.
        run_tuning: Tunes the hyperparameters of a model.
        run_pipeline: Runs the entire machine learning pipeline.
        run_retrain_best_model: Retrains the best model on the entire dataset.
        pickle_model: Saves the best model to a file.
        get_metrics: Returns the metrics of the best model.
        plot_metrics: Plots the evaluation metrics of the best model.
    """

    def __init__(self, models, loader, preprocessor_class: Preprocessor, ml_reprocessing_class: MLPreprocessing,
                 trainer_class: Trainer, evaluation_class: Evaluation, model_tunning_class: ModelTunning,
                 model_selector_class: Selector) -> None:
        """
        Initializes the orchestrator class with the provided components.

        Args:
            models (list): A list of machine learning models to be trained and evaluated.
            loader (object): An object that loads the dataset.
            preprocessor_class (class): The class for preprocessing the dataset.
            ml_reprocessing_class (class): The class for preprocessing the dataset for machine learning.
            trainer_class (class): The class for training the models.
            evaluation_class (class): The class for evaluating the models.
            model_tunning_class (class): The class for tuning the hyperparameters of the models.
            model_selector_class (class): The class for selecting the best model based on a decision metric.
        """
        self.models = models
        self.loader = loader
        self.preprocessing_class = preprocessor_class
        self.ml_preprocessing_class = ml_reprocessing_class
        self.trainer_class = trainer_class
        self.evaluation_class = evaluation_class
        self.tunning_class = model_tunning_class
        self.model_selector_class = model_selector_class

    def run_preprocessing(self, replacement_columns=["tamanho_motor", "milhas_carro"], replacements=["L", "mile"],
                          adv_day_column="dia_aviso", adv_month_column="mes_aviso", adv_year_column="ano_aviso",
                          dumies_columns=["cor", "tipo_cambio"]):
        """
        Performs preprocessing on the dataset.

        Args:
            replacement_columns (list): A list of column names to be replaced.
            replacements (list): A list of replacement values corresponding to the replacement_columns.
            adv_day_column (str): The column name for the day of the advertisement.
            adv_month_column (str): The column name for the month of the advertisement.
            adv_year_column (str): The column name for the year of the advertisement.
            dumies_columns (list): A list of column names to be converted to dummy variables.

        Returns:
            object: An object that performs preprocessing on the dataset.
        """
        preprocessor = self.preprocessing_class(self.loader.dataset)
        preprocessor.replacements(columns=replacement_columns, replacements=replacements)
        preprocessor.set_types()

        if (adv_day_column in self.loader.dataset.df.columns and "mes_aviso" in self.loader.dataset.df.columns and
                "ano_aviso" in self.loader.dataset.df.columns):
            preprocessor.create_date(day_column=adv_day_column, month_column=adv_month_column,
                                     year_column=adv_year_column)

        if all([column in self.loader.dataset.df.columns for column in dumies_columns]):
            preprocessor.create_dummies(dumies_columns)

        preprocessor.cat_to_codes()
        preprocessor.process_date()
        preprocessor.process_labels()
        preprocessor.fill_na()
        preprocessor.drop_na()
        preprocessor.update_dataset()
        return preprocessor

    def run_ml_preprocessing(self, dataset, validation=True, samples=["train", "validation", "test"]):
        """
        Performs preprocessing for machine learning on the dataset.

        Args:
            dataset (object): The dataset object to be preprocessed.
            validation (bool): Whether to include a validation set in the preprocessing.
            samples (list): A list of sample names to be preprocessed.

        Returns:
            object: An object that performs preprocessing for machine learning on the dataset.
        """
        MLpreprocessor = self.ml_preprocessing_class(dataset)
        MLpreprocessor.undersample()
        MLpreprocessor.split(validation=validation)
        MLpreprocessor.set_scaler()
        MLpreprocessor.scale(samples)
        return MLpreprocessor

    def run_training(self, models, dataset):
        """
        Trains the models.

        Args:
            models (list): A list of machine learning models to be trained.
            dataset (object): The dataset object to be used for training.

        Returns:
            object: An object that trains the models.
        """
        trainer = self.trainer_class(models, dataset)
        trainer.train()
        return trainer

    def run_model_evaluation(self, models, dataset, sample):
        """
        Evaluates the models.

        Args:
            models (list): A list of machine learning models to be evaluated.
            dataset (object): The dataset object to be used for evaluation.
            sample (str): The name of the sample to be evaluated.

        Returns:
            object: An object that evaluates the models.
        """
        evaluator = self.evaluation_class(models, dataset)
        evaluator.run(sample=sample)
        return evaluator

    def run_model_selection(self, models, dataset, decision_metric="f1", sample="validation"):
        """
        Selects the best model based on a decision metric.

        Args:
            models (list): A list of machine learning models to be evaluated and selected.
            dataset (object): The dataset object to be used for evaluation.
            decision_metric (str): The decision metric to be used for model selection.
            sample (str): The name of the sample to be used for model selection.

        Returns:
            tuple: A tuple containing the trainer object, evaluator object, and the best model object.
        """
        trainer = self.run_training(models, dataset)
        evaluator = self.run_model_evaluation(trainer.models, dataset, sample=sample)
        best_model = self.model_selector_class(evaluator.models, decision_metric).select()
        return trainer, evaluator, best_model

    def run_tuning(self, model, dataset, cv=3, random_n_iter=10, grid_max_iter=10):
        """
        Tunes the hyperparameters of a model.

        Args:
            model (object): The model object to be tuned.
            dataset (object): The dataset object to be used for tuning.
            cv (int): The number of cross-validation folds.
            random_n_iter (int): The number of iterations for random search.
            grid_max_iter (int): The maximum number of iterations for grid search.

        Returns:
            object: An object that tunes the hyperparameters of the model.
        """
        tunning = self.tunning_class(model, dataset, cv=cv, random_state=123, scoring_fn="roc_auc")
        tunning.RandomSearch(n_iter=random_n_iter)
        tunning.GridSearch(max_iter=grid_max_iter, amplitude=0.5, cutoff=3)
        return tunning

    def run_pipeline(self, cv, random_n_iter, grid_max_iter):
        """
        Runs the entire machine learning pipeline.

        Args:
            cv (int): The number of cross-validation folds.
            random_n_iter (int): The number of iterations for random search.
            grid_max_iter (int): The maximum number of iterations for grid search.

        Returns:
            object: The best model object.
        """
        self.preprocessor = self.run_preprocessing()
        self.MLpreprocessor = self.run_ml_preprocessing(self.preprocessor.dataset, validation=True)
        self.trainer, self.evaluator, best_model = self.run_model_selection(self.models, self.preprocessor.dataset)
        self.tunning = self.run_tuning(best_model, self.MLpreprocessor.dataset, cv=cv, random_n_iter=random_n_iter,
                                       grid_max_iter=grid_max_iter)
        self.best_model = self.run_retrain_best_model()
        return self.best_model.model

    def run_retrain_best_model(self):
        """
        Retrains the best model on the entire dataset.

        Returns:
            object: The best model object.
        """
        best_model = self.tunning.best_model
        self.bestmodel_MLpreprocessor = self.run_ml_preprocessing(self.preprocessor.dataset, validation=False,
                                                                  samples=["train", "test"])
        self.bestmodel_trainer = self.run_training(best_model, self.bestmodel_MLpreprocessor.dataset)
        self.bestmodel_evaluator = self.run_model_evaluation(best_model,
                                                             self.bestmodel_MLpreprocessor.dataset, sample="test")
        self.pickle_model(best_model.model)
        return best_model

    def pickle_model(self, model, filename="best_model.pkl"):
        """
        Saves the best model to a file.

        Args:
            model (object): The model object to be saved.
            filename (str): The filename to save the model.

        Returns:
            None
        """
        pickle.dump(model, open(filename, "wb"))

    def get_metrics(self):
        """
        Returns the metrics of the best model.

        Returns:
            dict: A dictionary containing the metrics of the best model.
        """
        return self.best_model.show_metrics()

    def plot_metrics(self):
        """
        Plots the evaluation metrics of the best model.

        Returns:
            None
        """
        return self.evaluator.plot_metrics()


class orchestrator():
    """
    
    """


    def __init__(self,models,loader,
                 preprocessor_class:Preprocessor,
                 ml_reprocessing_class:MLPreprocessing,
                 trainer_class:Trainer,
                 evaluation_class:Evaluation,
                 model_tunning_class:ModelTunning,
                 model_selector_class:Selector) -> None:
        
        
        self.models=models
        self.loader=loader
        self.preprocessing_class=preprocessor_class
        self.ml_preprocessing_class=ml_reprocessing_class
        self.trainer_class=trainer_class
        self.evaluation_class=evaluation_class
        self.tunning_class=model_tunning_class
        self.model_selector_class=model_selector_class

    def run_preprocessing(self,replacement_columns=["tamanho_motor","milhas_carro"],replacements=["L","mile"],
                          adv_day_column="dia_aviso",adv_month_column="mes_aviso",adv_year_column="ano_aviso",dumies_columns=["cor","tipo_cambio"]):
        
        preprocessor=self.preprocessing_class(self.loader.dataset)
        preprocessor.replacements(columns=replacement_columns,replacements=replacements)
        preprocessor.set_types()

        if (adv_day_column in self.loader.dataset.df.columns and "mes_aviso" in self.loader.dataset.df.columns and "ano_aviso" in self.loader.dataset.df.columns):                
            preprocessor.create_date(day_column=adv_day_column,month_column=adv_month_column,year_column=adv_year_column)
        
        if all([column in self.loader.dataset.df.columns for column in dumies_columns]):
            preprocessor.create_dummies(dumies_columns)     

        preprocessor.cat_to_codes()
        preprocessor.process_date()
        preprocessor.process_labels()
        preprocessor.fill_na()
        preprocessor.drop_na()
        preprocessor.update_dataset()
        return preprocessor

    def run_ml_preprocessing(self,dataset,validation=True,samples=["train","validation","test"]):
        MLpreprocessor=self.ml_preprocessing_class(dataset)
        MLpreprocessor.undersample()
        MLpreprocessor.split(validation=validation)
        MLpreprocessor.set_scaler()
        MLpreprocessor.scale(samples)
        return MLpreprocessor

    def run_trainnig(self,models,dataset):
        trainer=self.trainer_class(models,dataset)
        trainer.train()
        return trainer
    
    def run_model_evaluation(self,models,dataset,sample):
        evaluator=self.evaluation_class(models,dataset)
        evaluator.run(sample=sample)
        return evaluator
    
    def run_model_selection(self,models,dataset,decision_metric="f1",sample="validation"):
        trainer=self.run_trainnig(models,dataset)
        evaluator=self.run_model_evaluation(trainer.models,dataset,sample=sample)
        best_model=self.model_selector_class(evaluator.models,decision_metric).select()
        return trainer,evaluator,best_model


    def run_tunning(self,model,dataset,cv=3,random_n_iter=10,grid_max_iter=10)->Model:
        tunning=self.tunning_class(model,dataset,cv=cv,random_state=123,scoring_fn="roc_auc")
        tunning.RandomSearch(n_iter=random_n_iter)
        tunning.GridSearch(max_iter=grid_max_iter,amplitude=0.5,cutoff=3)
        return tunning

    def run_pipeline(self,cv,random_n_iter,grid_max_iter):
        self.preprocessor=self.run_preprocessing()
        self.MLpreprocessor=self.run_ml_preprocessing(self.preprocessor.dataset,validation=True)
        self.trainer,self.evaluator,best_model=self.run_model_selection(self.models,self.preprocessor.dataset)
        self.tunning=self.run_tunning(best_model,self.MLpreprocessor.dataset,cv=cv,random_n_iter=random_n_iter,grid_max_iter=grid_max_iter)
        self.best_model=self.run_retrain_best_model()
        return self.best_model.model

    
    def run_retrain_best_model(self):
        best_model=self.tunning.best_model
        self.bestmodel_MLpreprocessor=self.run_ml_preprocessing(self.preprocessor.dataset,validation=False,samples=["train","test"])
        self.bestmodel_trainer=self.run_trainnig(best_model,self.bestmodel_MLpreprocessor.dataset)
        self.bestmodel_evaluator=self.run_model_evaluation(best_model,self.bestmodel_MLpreprocessor.dataset,sample="test")        
        self.pickle_model(best_model.model)
        return best_model

            
    def pickle_model(self,model,filename="best_model.pkl"):
        pickle.dump(model,open(filename,"wb"))

    def get_metrics(self):
        return self.best_model.show_metrics()

    def plot_metrics(self):
        return self.evaluator.plot_metrics()
    


### Execução do Pipeline

In [None]:
columns_map={"Maker":"fabricante"," Genmodel":"modelo_carro"," Genmodel_ID":"ano_modelo_carro",
            "Door_num":"portas","Seat_num":"lugares","repair_complexity":"nivel_conserto",
            "repair_cost":"custo_conserto","repair_date":"data_conserto","repair_hours":"tempo_conserto",
            "breakdown_date":"data_sinistro","Fuel_type":"combustível","Color":"cor","Adv_year":"ano_aviso",
            "Adv_month":"mes_aviso","Bodytype":"tipo_carro","issue":"tipo_falha","issue_id":"categoria_falha",
            "Reg_year":"ano_registro","Engin_size":"tamanho_motor","Gearbox":"tipo_cambio","Adv_day":"dia_aviso",
            "Runned_Miles":"milhas_carro","Price":"preço"}

cat_columns=["fabricante", 'modelo_carro', 'ano_modelo_carro', 'cor', 'tipo_carro' 
              ,'tipo_cambio', 'combustível', 'tipo_falha', 'categoria_falha','nivel_conserto']

disc_columns=["ano_registro","lugares","portas","dia_aviso",'tamanho_motor','mes_aviso','ano_aviso']
date_columns=["data_conserto","data_sinistro",]
cont_columns=['preço','tempo_conserto',"milhas_carro","custo_conserto"]
label_columns=["Label"]


loader=Loader('vehicle_claims_labeled.csv',cat_columns=cat_columns,date_columns=date_columns,cont_columns=cont_columns,disc_columns=disc_columns,label_columns=label_columns)
loader.rename_columns(columns_map)



KNN=Model(KNeighborsClassifier,params={},supervised=True,run_scaled=False)
KNN.grid_search_params(n_neighbors=range(3,25,2),weights=["uniform","distance"])
LR=Model(LogisticRegression,params={},supervised=True,run_scaled=False)
LR.random_grid_search_params(C=np.logspace(-4,4,20),penalty=["l1","l2"])
NB=Model(GaussianNB,params={},supervised=True,run_scaled=True)
NB.random_grid_search_params(var_smoothing=np.logspace(-9,-1,50))
models=[KNN,LR,NB]
ml_pipeline=orchestrator(models,loader,Preprocessor,MLPreprocessing,Trainer,Evaluation,ModelTunning,Selector)
ml_pipeline.run_pipeline(cv=3,random_n_iter=3,grid_max_iter=3)


In [None]:
ml_pipeline.get_metrics()

In [None]:
a=ml_pipeline.plot_metrics()