## Contexto:
Como Analista de Fraude de uma empresa de seguros de carro, preciso de uma solução eficiente que utilize machine learning para detectar fraudes em grandes volumes de dados, a fim de melhorar a precisão na identificação de atividades fraudulentas e minimizar perdas financeiras para a empresa.

Descrição: Como Analista de Fraude, Eu quero um aplicativo em Python que utilize técnicas de machine learning para analisar dados de seguros de carro, Para que eu possa detectar possíveis fraudes de maneira eficiente e eficaz.

# Critérios de Aceitação:
# Importação e Preparação de Dados:
O aplicativo deve permitir a importação de dados de seguros a partir de arquivos CSV, Excel ou de um banco de dados.
Deve realizar a limpeza dos dados, incluindo o tratamento de valores nulos e inconsistências.
Deve realizar a transformação dos dados categóricos em formatos numéricos utilizáveis pelo modelo de machine learning.
# Treinamento do Modelo:
O aplicativo deve oferecer a opção de treinar um modelo de machine learning utilizando um dataset histórico de seguros previamente identificado como legítimo ou fraudulento.
Deve incluir a divisão do dataset em conjuntos de treinamento e teste, e permitir a escolha do algoritmo de machine learning (e.g., Random Forest, Gradient Boosting, CatBoost).
# Detecção de Fraude:
O aplicativo deve ser capaz de aplicar o modelo treinado para prever a probabilidade de fraude em novos dados de seguros.
Deve gerar uma lista de registros com suas respectivas probabilidades de serem fraudulentos, destacando os casos com maior risco.
# Relatórios e Visualizações:
O aplicativo deve fornecer relatórios detalhados com métricas de performance do modelo (e.g., precisão, recall, F1-score, AUC-ROC).
Deve oferecer visualizações interativas, como gráficos de distribuição de fraudes detectadas, importância das features no modelo e comparações entre registros fraudulentos e legítimos.


In [None]:

from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler,Normalizer
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score
from typing import Union
from sklearn.base import BaseEstimator
from typing import Any, List
import time
from sklearn.base import BaseEstimator
from typing import Any, List
import time
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.model_selection import GroupKFold
from sklearn.model_selection._search import ParameterSampler
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
import pickle

In [None]:
class Dataset:


    ### passar preprocessador 
    ###   
    def __init__(self,df,cat_columns,date_columns,cont_columns,disc_column,label_column=None):
        self._df=df
        self.cat_columns= cat_columns
        self.date_columns= date_columns
        self.cont_columns= cont_columns
        self.disc_columns= disc_column
        
        if isinstance(label_column,str):
            label_column=[label_column]
            
        if label_column is not None:
            self.labeled=True
            self.label_column=label_column
        else:
            self.labeled=False
            self.label_column=[]
            
        
    def check_columns(self,all_columns,*columns_groups):
        columns=[]
        for group in columns_groups:
            if group:
                columns+=group
        if len(all_columns)==len(columns):

            raise ValueError(f"Columns do not match: {set(all_columns).difference(columns)}")        
        
       
        Dataset.check_columns(df.columns,cat_columns,date_columns,cont_columns,disc_column,label_column)
    
    @property
    def not_label_columns(self,date=False):
        if not date:
            return self.cat_columns+self.cont_columns+self.disc_columns
        else:
            return self.cat_columns+self.date_columns+self.cont_columns+self.disc_columns


    @property
    def df(self):
        return self._df
    
    @df.setter
    def df(self,df):
        self._df=df

    def set_sample(self,sample,type,name,subtype=None):
        if type  not in ["train","validation","test"]:
            raise ValueError("Type must be train, validation or test")
        if name not in ["X","Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled",None]:
            raise ValueError("Subtype must be scaled or none")
        
        att_name=name+"_"+type if subtype is None else name+"_"+type+"_"+subtype

        setattr(self,att_name,sample)
    
    def get_sample(self,type,name,subtype=None):
        if type  not in ["train","validation","test"]:
            raise ValueError("Type must be train, validation or test")
        if name not in ["X","Y"]:
            raise ValueError("Name must be X or Y")
        if subtype not in ["scaled",None]:
            raise ValueError("Subtype must be scaled or none")

                
        att_name=name+"_"+type if subtype is None else name+"_"+type+"_"+subtype

        return getattr(self,att_name)



In [None]:

class Loader:

    def __init__(self,path,cat_columns,date_columns,cont_columns,disc_columns,label_columns=None):
        df=pd.read_csv(path,dtype=str)
        self._dataset=Dataset(df,cat_columns,date_columns,cont_columns,disc_columns,label_columns)

    def rename_columns(self,columns_map):
        self._dataset.df=self._dataset.df.rename(columns=columns_map)
        
    @property
    def dataset(self):
        return self._dataset
        


In [None]:


class   Preprocessor:
    def __init__(self,dataset:Dataset):
        self.dataset=dataset
        self.df=dataset.df
        self.cat_columns= self.dataset.cat_columns
        self.date_columns= self.dataset.date_columns
        self.cont_columns= self.dataset.cont_columns
        self.disc_columns= self.dataset.disc_columns
        self.dataset.df
        
    def _convert_cat(self):
        self.df.loc[:,self.cat_columns]=self.df[self.cat_columns].astype('category')
    
    def _convert_date(self):
        for col in self.date_columns:
            self.df[col]=pd.to_datetime(self.df[col],errors='coerce')
    
    def _convert_cont(self):
        self.df[self.cont_columns]=self.df[self.cont_columns].astype(float)
  

    def _covert_discrete(self):
        self.df[self.disc_columns]=self.df[self.disc_columns].astype(float).astype(int)
       

    def fill_na(self,fill_value=""):
        self.df=self.df.fillna(fill_value)
    
    def na_analysis(self):
        return self.df.isna().sum()
    
    def drop_na(self):
        self.df=self.df.dropna()
    
    
    def replacements(self,columns,replacements):
        if isinstance(replacements,str):
            replacements=[replacements]*len(columns)
        if isinstance(columns,str):
            columns=[columns]
        if len(columns)!=len(replacements):
            raise ValueError("Columns and replacements must have the same length")

        for column,replacement in zip(columns,replacements):
            self.df[column]=self.df[column].str.replace(replacement,"")

    def create_dummies(self,columns):
        new_columns=[]
        for column in columns: new_columns+=list(map(lambda x:column+"_"+x, self.df[column].unique().tolist()))
        self.df=pd.get_dummies(self.df,columns=columns)
        for column in columns:
            if column in self.cat_columns:
                self.cat_columns.remove(column)
        self.cat_columns+=new_columns        
    
    def create_date(self,day_column,month_column,year_column,date_column=None):
        if date_column is None:
            date_column="data_"+day_column.split("_")[-1]
        self.df[date_column]=pd.to_datetime(self.df[day_column].astype(float).astype(int).astype(str)\
                                                +"-"+self.df[month_column].astype(float).astype(int).astype(str)\
                                                +"-"+self.df[year_column].astype(float).astype(int).astype(str)\
                                        ,errors='coerce')
        self.date_columns.append(date_column)
    

    def process_date(self):        
        for name,func in zip(["weekday","day","month","year"],[lambda x: x.dt.weekday,lambda x: x.dt.day,lambda x: x.dt.month,lambda x: x.dt.year]):
            columns=list(map(lambda x: x+"_"+name,self.date_columns))
            self.df[columns]=self.df[self.date_columns].apply(func)
            self.disc_columns+=columns
    
    def process_labels(self):
        self.df[self.dataset.label_column]=self.df[self.dataset.label_column].apply(lambda x: x.astype('category').cat.codes)

    def cat_to_codes(self,columns=None):
        if columns is None:
            columns=self.cat_columns
        print(columns)
        self.df[columns]=self.df[columns].apply(lambda x: x.astype('category').cat.codes)
        # for column in columns:
            # self.df.iloc[:,column]=self.df[column].astype("category").cat.codes
        
    def set_types(self):       
        # self._convert_cat()
        self._convert_date()
        self._convert_cont()
        self._covert_discrete()     

    def update_dataset(self):
        self.dataset.df=self.df
        self.dataset.cat_columns=self.cat_columns
        self.dataset.date_columns=self.date_columns
        self.dataset.cont_columns=self.cont_columns
        self.dataset.disc_columns=self.disc_columns


In [None]:


class MLPreprocessing():


    def _split(X,Y,**kwargs):

        X_train, X_test, Y_train, Y_test=train_test_split(X,Y,**kwargs)

        return X_train, X_test, Y_train, Y_test
   

    def __init__(self,dataset:Dataset) -> None:
        self.dataset=dataset
        self.dataset.__setattr__("scalable_columns",self.dataset.cont_columns+self.dataset.disc_columns)
        self.dataset.__setattr__("not_scalable_columns",self.dataset.cat_columns)

    def undersample(self):
        positive_sample_size=(self.dataset.df[self.dataset.label_column[0]]==1).sum()
        self.dataset.__setattr__("full_df",self.dataset.df)
        self.dataset.df=pd.concat([self.dataset.df[self.dataset.df[self.dataset.label_column[0]]==0].sample(positive_sample_size),
                                   self.dataset.df[self.dataset.df[self.dataset.label_column[0]]==1]],axis=0)
        print(positive_sample_size)


    def set_scaler(self,type="minmax"):
        scalers={"minmax":MinMaxScaler,"standard":StandardScaler,"robust":RobustScaler,"normalizer":RobustScaler}
        self.scaler=scalers[type]()

    def scale(self,sample=["train","validation","test"]):

        if not hasattr(self,"scaler"):
            raise ValueError("Scaler was not set")
        if not hasattr(self.dataset,"X_train"):
            raise ValueError("Train data was not set")        

        X_train_scaled=self.scaler.fit_transform(self.dataset.X_train[self.dataset.scalable_columns].astype(float))
        X_train_scaled=pd.concat([self.dataset.X_train[self.dataset.not_scalable_columns].reset_index(drop=True),
                                  pd.DataFrame(X_train_scaled,columns=self.dataset.scalable_columns)],axis=1)
        self.dataset.set_sample(X_train_scaled,"train","X",subtype="scaled")

        if "test" in sample and hasattr(self.dataset,"X_test"):
            X_test_scaled=self.scaler.transform(self.dataset.X_test[self.dataset.scalable_columns])
            X_test_scaled=pd.concat([self.dataset.X_test[self.dataset.not_scalable_columns].reset_index(drop=True),
                                     pd.DataFrame(X_test_scaled,columns=self.dataset.scalable_columns)],axis=1)
            self.dataset.set_sample(X_test_scaled,"test","X",subtype="scaled")

        if "validation" in sample and hasattr(self.dataset,"X_validation"):
            X_val_scaled=self.scaler.transform(self.dataset.X_validation[self.dataset.scalable_columns])
            X_val_scaled=pd.concat([self.dataset.X_validation[self.dataset.not_scalable_columns].reset_index(drop=True),
                                    pd.DataFrame(X_val_scaled,columns=self.dataset.scalable_columns)],axis=1)
            self.dataset.set_sample(X_val_scaled,"validation","X",subtype="scaled")
    

    def split(self,validation=True,stratified=True,columns_stratify=None):

         
        train_size=0.7 if validation else 0.80
        validation_size=0.15
        test_size=0.15 if validation else 0.20
        if not self.dataset.labeled:
            raise NotImplemented()
        
        if stratified and columns_stratify is not None:            
            columns_stratify+=self.dataset.label_column

              
        X_train, X_test, Y_train, Y_test=MLPreprocessing._split(self.dataset.df[self.dataset.not_label_columns],self.dataset.df[self.dataset.label_column],random_state=42,test_size=test_size+validation_size,stratify=columns_stratify)        
        if validation:
                X_test, X_val, Y_test, Y_val=MLPreprocessing._split(X_test,Y_test,random_state=42,test_size=validation_size/(test_size+validation_size),stratify=columns_stratify)
        else:
            X_val,Y_val=None,None
        
        self.dataset.set_sample(X_train,"train","X")
        self.dataset.set_sample(Y_train,"train","Y")
        self.dataset.set_sample(X_val,"validation","X")
        self.dataset.set_sample(Y_val,"validation","Y")
        self.dataset.set_sample(X_test,"test","X")
        self.dataset.set_sample(Y_test,"test","Y")



    def cross_validation():
        pass


In [None]:
class Model(BaseEstimator):

    @staticmethod
    def camel_case_split(str):     
        start_idx = [i for i, e in enumerate(str)
                    if e.isupper()] + [len(str)]
    
        start_idx = [0] + start_idx
        return [str[x: y] for x, y in zip(start_idx, start_idx[1:])]         
    
    
    def __init__(self,model,params,supervised:bool,run_scaled:bool=False,run_on_categorical=True,run_on_continues=True,created=False) -> None:
        self._model=model
        self.params=params
        self.supervised=supervised
        if not created:
            self.set_params()
        else:
            self.created=created
        self.run_scaled=run_scaled
        self.run_on_categorical=run_on_categorical
        self.run_on_continues=run_on_continues
        if not (run_on_categorical or run_on_continues):
            raise ValueError("Model must run on categorical and/or continues columns")
        
    def get_X(self,type:Union["train","test","validation"],dataset:Dataset):       
        
        subtype="scaled" if self.run_scaled else None
        columns=dataset.not_label_columns
        if not self.run_on_categorical:
            columns=list(set(columns).difference(dataset.cat_columns))
        if not self.run_on_continues:
            columns=list(set(columns).difference(dataset.cont_columns))
            
        return dataset.get_sample(type,"X",subtype=subtype)[columns]
    
    def set_params(self):
        self.created=True
        self._model=self._model(**self.params)
    
    def grid_search_params(self,**params):
        self.grid_search_params=params

    def random_grid_search_params(self,**params):
        self.grid_search_params=params

    @property
    def model(self):
        if not hasattr(self,"created"):
            raise ValueError("Model was not set")
        
        return self._model        

    def set_metrics(self,metric,value):
        if not hasattr(self,"metrics"):
            self.metrics={}
        self.metrics[metric]=value

    def show_metrics(self):
        return self.metrics
    
    def fit(self,Y,X=None,type=None,dataset:Dataset=None,**kwargs):
        if X is None and type is not None and dataset is not None:
            X=self.get_X(type,dataset)
            
        return self._model.fit(X,Y,**kwargs)

    def __str__(self) -> str:
        name=(self.model.__str__()).split("(")[0]
        if self.run_scaled:
            name+="_scaled"
        if self.run_on_categorical:
            name+="_cat"
        if self.run_on_continues:
            name+="_cont"
        return name
            
    def __call__(self,X=None,type=None,dataset:Dataset=None):
        if type is not None and Dataset is not None:
            X=self.get_X(type,dataset)
        if X is None:
            raise ValueError("X or type and Dataset must be passed")
        
        return self._model.predict(X)



In [None]:
class Trainer():
    def __init__(self,models:List[BaseEstimator],dataset:Dataset) -> None:
        if not isinstance(models,list):
            models=[models]
            
        self.models=models
        self.dataset=dataset




    def _train(self,model:Model,**kwargs):
            t1=time.time()
            if model.supervised:
                    model.fit(type="train",dataset=self.dataset,Y=self.dataset.Y_train,**kwargs)
            
            else:
                    model.fit(tpe="train",dataset=self.dataset,**kwargs)
             
            t2=time.time()
            model.__setattr__("training_time",t2-t1)
        

    def train(self,**kwargs):
        if "model" not in kwargs:            
            for model in self.models:                    
                self._train(model,**kwargs)
        else:
            model=kwargs.pop("model")
            self._train(model,**kwargs)
        


    def run_grid_search(self,random_state=123,n_iter=10):
        np.random.seed(random_state)

        for model in self.models:
            if not hasattr(model,"grid_search_params"):
                continue
            else:
                parameters=ParameterSampler(model.grid_search_params,n_iter=n_iter,random_state=random_state)
                for param in parameters:
                    model.copy().set_params(**param)
                    self._train(model)
            
                         

    def run_evaluation(self):
        for model in self.models:
            model.evaluate()

    def add(self,model:Model,train=True):
        self.models.append(model)
        if train:
            self.train(model=model)


In [None]:

class Evaluation():
    
    def __init__(self,models:List[Model],dataset:Dataset) -> None:
        if not isinstance(models,list):
            models=[models]
            
        self.dataset=dataset
        self.models=models


    def run(self,sample="validation"):
        for model in self.models:
           
            accuracy=accuracy_score(self.dataset.Y_validation,model(type=sample,dataset=self.dataset))  
            precision=precision_score(self.dataset.Y_validation,model(type=sample,dataset=self.dataset))
            recall=recall_score(self.dataset.Y_validation,model(type=sample,dataset=self.dataset))
            f1=f1_score(self.dataset.Y_validation,model(type=sample,dataset=self.dataset))
            roc_auc=roc_auc_score(self.dataset.Y_validation,model(type=sample,dataset=self.dataset))

            model.set_metrics("accuracy",accuracy)
            model.set_metrics("precision",precision)
            model.set_metrics("recall",recall)
            model.set_metrics("f1",f1)
            model.set_metrics("roc_auc",roc_auc)


    def metrics(self):
        metrics={}
        for model in self.models:
            metrics[str(model)]=model.show_metrics()
        return metrics
    
    def plot_metrics(self,orient="model"):
        if orient=="model":
            fig,axs=plt.subplot_mosaic([["accuracy","precision","recall"],["f1","roc_auc","vazio"]],sharey=True,figsize=(10,4))
            for metric in ["accuracy","precision","recall","f1","roc_auc"]:
                values=[]
                for model in self.models:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[metric].set_title(metric)
                sns.barplot(x=values,y=[str(model) for model in self.models],ax=axs[metric])
                axs[metric].set_xlim(0.5,1)
            plt.tight_layout()
            return fig
        else:
            fig,axs=plt.subplots(len(self.models)//5+1,5,sharey=True)
            if len(self.models)//5+1 ==1:
                axs=axs.reshape(1,-1)

            for index,model in enumerate(self.models):
                values=[]
                for metric in ["accuracy","precision","recall","f1","roc_auc"]:
                    values.append(model.__getattribute__("metrics")[metric])
                axs[index//5,index%5].set_title(model.model,fontsize=8)
                sns.barplot(x=values,y=["accuracy","precision","recall","f1","roc_auc"],ax=axs[index//5,index%5])
            plt.tight_layout()
            return fig



In [None]:
class Selector():
    
    def __init__(self,models:List[Model],aimed_metric="accuracy") -> None:
        if not isinstance(models,list):
            models=[models]
        
        self.models=models
        self.aimed_metric=aimed_metric
        pass
    
    def check_metric(self):
        for model in self.models:
            if not hasattr(model,"metrics"):
                raise ValueError(f"{str(model)} has no metrics")
            elif model.metrics.get(self.aimed_metric) is None:
                    raise ValueError(f"{str(model)} has no metrics")
            
    
    def select(self):
        scores=[]
        for model in self.models:
            scores.append(model.metrics.get(self.aimed_metric))
        
        argmax=scores.index(max(scores))

        return self.models[argmax]




In [None]:
class ModelTunning:
    
    @staticmethod
    def ideal_cutoff(size,cutoff,max_iter):
        
        if size*cutoff>max_iter:
            cutoff-=1
            return ModelTunning.ideal_cutoff(size,cutoff,max_iter)
            #return cutoff
        else:
            return cutoff
    def __init__(self,model:Model,dataset:Dataset,cv=10,random_state=123,scoring_fn="accuracy") -> None:
        self.model=model
        self.dataset=dataset
        self.random_state=random_state
        self.cv=cv
        self.scoring_fn=scoring_fn

    def RandomSearch(self,n_iter=10):
        search=RandomizedSearchCV(self.model.model,self.model.grid_search_params,n_iter=n_iter,random_state=self.random_state,
                                  scoring=self.scoring_fn,cv=self.cv)
        
        search.fit(self.model.get_X(type="train",dataset=self.dataset),self.dataset.Y_train)
        self.random_search=search
        self.random_search_best_model=self.model.copy()
        self.random_search_best_model._model=clone(self.random_search.best_estimator_)
        return search.cv_results_
    

    
    def GridSearch(self,max_iter=30,amplitude=0.5,cutoff=5,params=None):
        if params is None:            
            if not hasattr(self,"random_search"):
                raise ValueError("Params must be passed or RandomSearch must be run")
            
            params=self.random_search.best_params_
            if max_iter<3*len(params):
                print("Max_iter is less than the number of parameters. It is being set to 3 times the number of parameters")
                max_iter=3*len(params)

            
            cutoff=self.ideal_cutoff(len(params),cutoff,max_iter)
        
            for param,value in params.items():                
                if isinstance(value,int):
                    params[param]=np.arange(value-min(cutoff,int(value*amplitude)),value+min(cutoff,int(value*amplitude)))
                elif isinstance(value,float):
                    params[param]=np.arange(value-min(cutoff,value*amplitude),value+min(cutoff,value*amplitude))

                    
        fine_search=GridSearchCV(self.model.model,params,scoring=self.scoring_fn,cv=self.cv)       
        fine_search.fit(self.model.get_X(type="train",dataset=self.dataset),self.dataset.Y_train)
        self.fine_search=fine_search
        self.grid_search_best_model=self.model.copy()
        self.grid_search_best_model._model=clone(self.random_search.best_estimator_)
        return fine_search.cv_results_
    
    @property
    def best_model(self):
        if hasattr(self,"grid_search_best_model"):
            return self.grid_search_best_model
        elif hasattr(self,"random_search_best_model"):
            return self.random_search_best_model
        else:
            raise ValueError("No search was run")



In [None]:

class orchestrator():

    def __init__(self,models,loader,
                 preprocessor_class:Preprocessor,
                 ml_reprocessing_class:MLPreprocessing,
                 trainer_class:Trainer,
                 evaluation_class:Evaluation,
                 model_tunning_class:ModelTunning,
                 model_selector_class:Selector) -> None:
        
        self.models=models
        self.loader=loader
        self.preprocessing_class=preprocessor_class
        self.ml_preprocessing_class=ml_reprocessing_class
        self.trainer_class=trainer_class
        self.evaluation_class=evaluation_class
        self.tunning_class=model_tunning_class
        self.model_selector_class=model_selector_class

    def run_preprocessing(self,replacement_columns=["tamanho_motor","milhas_carro"],replacements=["L","mile"],
                          adv_day_column="dia_aviso",adv_month_column="mes_aviso",adv_year_column="ano_aviso",dumies_columns=["cor","tipo_cambio"]):
        
        preprocessor=self.preprocessing_class(self.loader.dataset)
        preprocessor.replacements(columns=replacement_columns,replacements=replacements)
        preprocessor.set_types()

        if (adv_day_column in self.loader.dataset.df.columns and "mes_aviso" in self.loader.dataset.df.columns and "ano_aviso" in self.loader.dataset.df.columns):                
            preprocessor.create_date(day_column=adv_day_column,month_column=adv_month_column,year_column=adv_year_column)
        
        if all([column in self.loader.dataset.df.columns for column in dumies_columns]):
            preprocessor.create_dummies(dumies_columns)     

        preprocessor.cat_to_codes()
        preprocessor.process_date()
        preprocessor.process_labels()
        preprocessor.fill_na()
        preprocessor.drop_na()
        preprocessor.update_dataset()
        return preprocessor

    def run_ml_preprocessing(self,dataset,validation=True):
        MLpreprocessor=self.ml_preprocessing_class(dataset)
        MLpreprocessor.undersample()
        MLpreprocessor.split(validation=validation)
        MLpreprocessor.set_scaler()
        MLpreprocessor.scale()
        return MLpreprocessor

    def run_trainnig(self,models,dataset):
        trainer=self.trainer_class(models,dataset)
        trainer.train()
        return trainer
    
    def run_model_evaluation(self,models,dataset):
        evaluator=self.evaluation_class(models,dataset)
        evaluator.run()
        return evaluator
    
    def run_model_selection(self,models,dataset,decision_metric="f1"):
        trainer=self.run_trainnig(models,dataset)
        evaluator=self.run_model_evaluation(trainer.models,dataset)
        best_model=self.model_selector_class(evaluator.models,decision_metric).select()
        return trainer,evaluator,best_model


    def run_tunning(self,model,dataset,cv=3,random_n_iter=10,grid_max_iter=10)->Model:
        tunning=self.tunning_class(model,dataset,cv=cv,random_state=123,scoring_fn="roc_auc")
        tunning.RandomSearch(n_iter=random_n_iter)
        tunning.GridSearch(max_iter=grid_max_iter,amplitude=0.5,cutoff=3)
        return tunning

    def run_pipeline(self,cv,random_n_iter,grid_max_iter):
        self.preprocessor=self.run_preprocessing()
        self.MLpreprocessor=self.run_ml_preprocessing(self.preprocessor.dataset,validation=True)
        self.trainer,self.evaluator,best_model=self.run_model_selection(self.models,self.preprocessor.dataset)
        self.tunning=self.run_tunning(best_model,self.MLpreprocessor.dataset,cv=cv,random_n_iter=random_n_iter,grid_max_iter=grid_max_iter)
        self.best_model=self.retrain_best_model()
        return self.best_model.model

    def concat_train_validation_data(self,X:List[pd.DataFrame],Y):
        X=pd.concat(X,axis=1,ignore_index=True)
        Y=pd.concat(Y,axis=1,ignore_index=True)
        return X,Y
    
    def retrain_sbest_model(self):
        best_model=self.tunning.best_model
        self.bestmodel_MLpreprocessor=self.run_ml_preprocessing(self.preprocessor.dataset,validation=False)
        self.bestmodel_trainer=self.run_trainnig(self.best_model,self.bestmodel_MLpreprocessor.dataset)
        self.bestmodel_evaluator=self.run_model_evaluation([best_model],self.bestmodelMLpreprocessor.dataset)        
        self.pickle_model(best_model.model)
        return best_model

            
    def pickle_model(self,model,filename="best_model.pkl"):
        pickle.dump(model,open(filename,"wb"))

    def get_metrics(self):
        self.best_model.show_metrics()

    def plot_metrics(self):
        self.evaluator.plot_metrics()
    


In [None]:
columns_map={"Maker":"fabricante"," Genmodel":"modelo_carro"," Genmodel_ID":"ano_modelo_carro",
            "Door_num":"portas","Seat_num":"lugares","repair_complexity":"nivel_conserto",
            "repair_cost":"custo_conserto","repair_date":"data_conserto","repair_hours":"tempo_conserto",
            "breakdown_date":"data_sinistro","Fuel_type":"combustível","Color":"cor","Adv_year":"ano_aviso",
            "Adv_month":"mes_aviso","Bodytype":"tipo_carro","issue":"tipo_falha","issue_id":"categoria_falha",
            "Reg_year":"ano_registro","Engin_size":"tamanho_motor","Gearbox":"tipo_cambio","Adv_day":"dia_aviso",
            "Runned_Miles":"milhas_carro","Price":"preço"}

cat_columns=["fabricante", 'modelo_carro', 'ano_modelo_carro', 'cor', 'tipo_carro' 
              ,'tipo_cambio', 'combustível', 'tipo_falha', 'categoria_falha','nivel_conserto']

disc_columns=["ano_registro","lugares","portas","dia_aviso",'tamanho_motor','mes_aviso','ano_aviso']
date_columns=["data_conserto","data_sinistro",]
cont_columns=['preço','tempo_conserto',"milhas_carro","custo_conserto"]
label_columns=["Label"]


loader=Loader('vehicle_claims_labeled.csv',cat_columns=cat_columns,date_columns=date_columns,cont_columns=cont_columns,disc_columns=disc_columns,label_columns=label_columns)
loader.rename_columns(columns_map)



KNN=Model(KNeighborsClassifier,params={},supervised=True,run_scaled=False)
KNN.grid_search_params(n_neighbors=range(3,25,2),weights=["uniform","distance"])
LR=Model(LogisticRegression,params={},supervised=True,run_scaled=False)
LR.random_grid_search_params(C=np.logspace(-4,4,20),penalty=["l1","l2"])
NB=Model(GaussianNB,params={},supervised=True,run_scaled=True)
NB.random_grid_search_params(n_features_in=range(3,20),var_smoothing=np.logspace(-9,-1,20))
models=[KNN,LR,NB]
ml_pipeline=orchestrator(models,loader,Preprocessor,MLPreprocessing,Trainer,Evaluation,ModelTunning,Selector)



In [None]:
ml_pipeline.run_pipeline(cv=3,random_n_iter=3,grid_max_iter=3)