<a href="https://colab.research.google.com/github/deiveleal/data/blob/main/mestrado/ft105/classificacao/ClassificationNaiveBayesKNNEnsemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#FT105A - Tópico Interdisciplinar I: Introdução ao Aprendizado de Máquina
### Aluno: Deive Audieres Leal
### RA: 083423


### Tarefa 2: Naïve Bayes, k-NN e Ensembles

#### Link para o repositório: [ClassificationNaiveBayesKNNEnsemble](https://github.com/deiveleal/data/blob/main/mestrado/ft105/classificacao/ClassificationNaiveBayesKNNEnsemble.ipynb)

# Enunciado:

Escolha um pacote de software para classificação de dados que tenha implementações de algoritmos de árvores de decisão, Naïve Bayes e k-NN, e aplique estes três algoritmos ao conjunto de dados Liver Disorder:

* Utilize subamostragem aleatória com 5 repetições para cada algoritmo e apresente o erro de classificação médio de cada um (para os conjuntos de testes);
* Adote uma divisão de 70% dos dados para treinamento e 30% dos dados para teste;
* Faça a amostragem antes de iniciar o treinamento e use os mesmos dados para todos os algoritmos (em cada repetição);

Para cada repetição, monte um ensemble com os classificadores já treinados (via voto majoritário), aplique ao conjunto de testes e apresente o desempenho médio.

ATENÇÃO: Não se esqueça de apresentar no relatório os parâmetros definidos para cada algoritmo (caso existam)!

ATENÇÃO: Verifique na documentação do conjunto de dados como definir o atributo alvo, e explique o procedimento adotado no relatório.

#### Instala o repositório da UCI

In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


#### Importação das bibliotecas

In [2]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import naive_bayes as nb
from sklearn import neighbors
from ucimlrepo import fetch_ucirepo

#### Importa o conjunto de dados

#### Realiza a busca dos dados


In [3]:
liver_disorders = fetch_ucirepo(id=60)

#### Os dados são um dataframe pandas já separado em features e targets

In [4]:
X = liver_disorders.data.features.astype(str)
y = liver_disorders.data.targets.astype(str)

In [5]:
# metadata
print("Metadata:\n")
liver_disorders.metadata

Metadata:



{'uci_id': 60,
 'name': 'Liver Disorders',
 'repository_url': 'https://archive.ics.uci.edu/dataset/60/liver+disorders',
 'data_url': 'https://archive.ics.uci.edu/static/public/60/data.csv',
 'abstract': 'BUPA Medical Research Ltd. database donated by Richard S. Forsyth',
 'area': 'Health and Medicine',
 'tasks': ['Regression'],
 'characteristics': ['Multivariate'],
 'num_instances': 345,
 'num_features': 5,
 'feature_types': ['Categorical', 'Integer', 'Real'],
 'demographics': [],
 'target_col': ['drinks'],
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 2016,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C54G67',
 'creators': [],
 'intro_paper': None,
 'additional_info': {'summary': 'The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the dataset constitutes the record of a single male individual.\n\n

In [6]:
# variable information
print("Variables: \n", liver_disorders.variables)

Variables: 
        name     role         type demographic  \
0       mcv  Feature   Continuous        None   
1   alkphos  Feature   Continuous        None   
2      sgpt  Feature   Continuous        None   
3      sgot  Feature   Continuous        None   
4   gammagt  Feature   Continuous        None   
5    drinks   Target   Continuous        None   
6  selector    Other  Categorical        None   

                                         description units missing_values  
0                            mean corpuscular volume  None             no  
1                               alkaline phosphotase  None             no  
2                           alanine aminotransferase  None             no  
3                         aspartate aminotransferase  None             no  
4                      gamma-glutamyl transpeptidase  None             no  
5  number of half-pint equivalents of alcoholic b...  None             no  
6  field created by the BUPA researchers to split...  None    

### Mostra o dataframe montado

In [7]:
df = pd.concat([X,y], axis=1)
df.head(7)

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt,drinks
0,85,92,45,27,31,0.0
1,85,64,59,32,23,0.0
2,86,54,33,16,54,0.0
3,91,78,34,24,36,0.0
4,87,70,12,28,10,0.0
5,98,55,13,17,17,0.0
6,88,62,20,17,9,0.5


## Classificação com Árvore de Decisão

#### Primeira execução

##### Separa o dataset em 80% de treino e 20% de teste, de forma randomica e com embaralhamento, foi utilizado o valor 52 no random_state como semente, para que ao alterar o mesmo na próxima execução, os valores escolhidos para treino e teste sejam diferentes

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3
    )

In [9]:
X_train

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt
94,89,62,42,30,20
85,94,58,21,18,26
97,89,89,23,18,104
284,96,70,21,26,21
197,88,74,31,25,15
...,...,...,...,...,...
334,91,138,45,21,48
59,90,63,24,24,24
126,92,73,24,21,48
5,98,55,13,17,17


In [10]:
X_test

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt
146,96,70,70,26,36
339,87,56,52,43,55
231,87,57,30,30,22
6,88,62,20,17,9
249,90,63,45,24,85
...,...,...,...,...,...
166,92,79,70,32,84
330,95,73,20,25,225
281,89,48,32,22,14
106,91,80,37,23,27


In [11]:
y_train

Unnamed: 0,drinks
94,3.0
85,2.0
97,3.0
284,4.0
197,0.5
...,...
334,10.0
59,0.5
126,4.0
5,0.0


In [12]:
y_test

Unnamed: 0,drinks
146,6.0
339,10.0
231,0.5
6,0.5
249,1.0
...,...
166,7.0
330,8.0
281,4.0
106,4.0


##### Cria função com o modelo de árvore de decisão. Foram usados os parâmetros padrões com exceção do criterio que passei a utilizar o 'log_loss' como ganho da informação ao invés do padrão gini. É retornado a acurácia do modelo ao final da execução.

In [13]:
def tree_classifier(X_train, X_test, y_train, y_test):
    clf = tree.DecisionTreeClassifier(criterion='log_loss')
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    return {"accuracy":accuracy, "prediction":y_pred}

##### Acurácia do modelo! O quanto ele acertou?

In [14]:
print("Acurácia Tree:", tree_classifier(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
    ))

Acurácia Tree: {'accuracy': 0.11538461538461539, 'prediction': array(['8.0', '8.0', '2.0', '3.0', '6.0', '2.0', '2.0', '5.0', '2.0',
       '0.5', '3.0', '6.0', '4.0', '8.0', '0.5', '6.0', '0.5', '0.5',
       '2.0', '0.5', '2.0', '4.0', '2.0', '4.0', '0.5', '6.0', '6.0',
       '0.5', '6.0', '2.0', '0.5', '6.0', '0.5', '4.0', '0.5', '4.0',
       '0.5', '0.5', '2.0', '3.0', '0.5', '4.0', '2.0', '3.0', '1.0',
       '2.0', '20.0', '2.0', '0.5', '6.0', '0.5', '2.0', '0.5', '10.0',
       '0.5', '6.0', '8.0', '10.0', '4.0', '0.5', '2.0', '4.0', '6.0',
       '5.0', '12.0', '8.0', '3.0', '0.5', '0.5', '5.0', '0.5', '2.0',
       '6.0', '0.5', '4.0', '4.0', '2.0', '8.0', '2.0', '6.0', '4.0',
       '5.0', '7.0', '0.5', '0.5', '4.0', '0.0', '1.0', '2.0', '0.5',
       '0.5', '0.5', '4.0', '4.0', '8.0', '4.0', '4.0', '0.5', '6.0',
       '2.0', '4.0', '2.0', '0.5', '0.5'], dtype=object)}


In [15]:
def knn_classifier(X_train, X_test, y_train, y_test):
    clf = neighbors.KNeighborsClassifier()
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    return {"accuracy":accuracy, "prediction":y_pred}

In [16]:
print("Acurácia KNN:", knn_classifier(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
    ))

Acurácia KNN: {'accuracy': 0.21153846153846154, 'prediction': array(['0.0', '1.0', '2.0', '0.5', '10.0', '2.0', '0.5', '3.0', '1.0',
       '0.5', '0.5', '0.5', '0.5', '2.0', '0.5', '0.5', '0.5', '0.5',
       '0.5', '0.5', '3.0', '0.5', '4.0', '0.5', '0.5', '10.0', '20.0',
       '1.0', '0.5', '0.5', '8.0', '2.0', '0.5', '0.5', '0.5', '8.0',
       '0.0', '0.5', '0.5', '0.5', '0.5', '0.5', '1.0', '3.0', '1.0',
       '4.0', '0.5', '0.5', '2.0', '0.5', '0.5', '2.0', '0.5', '0.5',
       '4.0', '2.0', '0.5', '0.5', '4.0', '0.5', '0.5', '0.5', '6.0',
       '4.0', '0.5', '0.5', '1.0', '0.5', '0.5', '2.0', '0.5', '2.0',
       '5.0', '0.5', '3.0', '0.5', '0.5', '0.5', '0.5', '4.0', '1.0',
       '4.0', '0.5', '0.5', '0.5', '0.5', '0.5', '0.5', '0.5', '0.5',
       '6.0', '0.5', '1.0', '0.5', '3.0', '0.5', '0.5', '0.5', '0.5',
       '6.0', '2.0', '0.5', '2.0', '0.5'], dtype=object)}


  return self._fit(X, y)


In [17]:
def nb_classifier(X_train, X_test, y_train, y_test):
    clf = nb.GaussianNB()
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    return {"accuracy":accuracy, "prediction":y_pred}

In [18]:
print("Acurácia NB:", nb_classifier(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
    ))

Acurácia NB: {'accuracy': 0.19230769230769232, 'prediction': array(['0.5', '0.5', '0.5', '3.0', '2.0', '0.5', '5.0', '3.0', '0.5',
       '0.5', '0.5', '0.5', '0.5', '3.0', '0.5', '6.0', '5.0', '6.0',
       '0.5', '7.0', '3.0', '0.5', '0.5', '0.5', '0.5', '5.0', '7.0',
       '2.0', '0.5', '0.5', '8.0', '0.5', '0.5', '0.5', '1.0', '2.0',
       '0.5', '0.5', '1.0', '3.0', '0.5', '3.0', '3.0', '3.0', '3.0',
       '3.0', '20.0', '3.0', '0.5', '0.5', '2.0', '3.0', '3.0', '0.5',
       '0.5', '0.5', '0.5', '0.5', '0.5', '5.0', '10.0', '0.5', '0.5',
       '0.5', '20.0', '6.0', '3.0', '0.5', '1.0', '5.0', '0.5', '0.5',
       '16.0', '3.0', '3.0', '3.0', '0.5', '1.0', '3.0', '5.0', '3.0',
       '2.0', '0.5', '0.5', '3.0', '1.0', '0.5', '1.0', '0.5', '5.0',
       '7.0', '8.0', '0.5', '0.5', '6.0', '0.5', '8.0', '5.0', '0.5',
       '8.0', '2.0', '1.0', '0.5', '0.5'], dtype='<U4')}


  y = column_or_1d(y, warn=True)


In [57]:
ensemble_source = {}
for execution in range(0, 5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    ensemble_source[f"tree_accuracy_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["accuracy"]
    ensemble_source[f"tree_classification_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["prediction"]
    ensemble_source[f"knn_accuracy_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["accuracy"]
    ensemble_source[f"knn_classification_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["prediction"]
    ensemble_source[f"nb_accuracy_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["accuracy"]
    ensemble_source[f"nb_classification_{execution}"] = tree_classifier(X_train, X_test, y_train, y_test)["prediction"]

In [58]:
df_ensemble = pd.DataFrame(ensemble_source)

In [59]:
df_ensemble

Unnamed: 0,tree_accuracy_0,tree_classification_0,knn_accuracy_0,knn_classification_0,nb_accuracy_0,nb_classification_0,tree_accuracy_1,tree_classification_1,knn_accuracy_1,knn_classification_1,...,knn_accuracy_3,knn_classification_3,nb_accuracy_3,nb_classification_3,tree_accuracy_4,tree_classification_4,knn_accuracy_4,knn_classification_4,nb_accuracy_4,nb_classification_4
0,0.201923,1.0,0.182692,1.0,0.173077,1.0,0.173077,0.0,0.163462,0.0,...,0.105769,15.0,0.134615,3.0,0.25,4.0,0.221154,4.0,0.192308,4.0
1,0.201923,2.0,0.182692,7.0,0.173077,0.5,0.173077,0.5,0.163462,0.5,...,0.105769,8.0,0.134615,8.0,0.25,6.0,0.221154,6.0,0.192308,6.0
2,0.201923,6.0,0.182692,6.0,0.173077,6.0,0.173077,2.0,0.163462,2.0,...,0.105769,0.5,0.134615,0.5,0.25,8.0,0.221154,8.0,0.192308,8.0
3,0.201923,4.0,0.182692,4.0,0.173077,4.0,0.173077,0.5,0.163462,0.5,...,0.105769,3.0,0.134615,3.0,0.25,4.0,0.221154,4.0,0.192308,4.0
4,0.201923,1.0,0.182692,1.0,0.173077,1.0,0.173077,0.5,0.163462,0.5,...,0.105769,0.5,0.134615,0.5,0.25,8.0,0.221154,8.0,0.192308,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,0.201923,4.0,0.182692,4.0,0.173077,4.0,0.173077,4.0,0.163462,6.0,...,0.105769,0.5,0.134615,0.5,0.25,0.5,0.221154,0.5,0.192308,0.5
100,0.201923,4.0,0.182692,4.0,0.173077,4.0,0.173077,4.0,0.163462,4.0,...,0.105769,10.0,0.134615,10.0,0.25,0.0,0.221154,0.0,0.192308,0.0
101,0.201923,5.0,0.182692,5.0,0.173077,5.0,0.173077,8.0,0.163462,8.0,...,0.105769,0.5,0.134615,0.5,0.25,8.0,0.221154,8.0,0.192308,8.0
102,0.201923,4.0,0.182692,2.0,0.173077,2.0,0.173077,0.5,0.163462,0.5,...,0.105769,4.0,0.134615,4.0,0.25,0.5,0.221154,0.5,0.192308,0.5


In [26]:
df_ensemble['tree_accuracy'] = tree_accuracy
df_ensemble['knn_accuracy'] = knn_accuracy
df_ensemble['nb_accuracy'] = nb_accuracy

In [27]:
df_ensemble

Unnamed: 0,tree_accuracy,knn_accuracy,nb_accuracy
0,0.240385,0.173077,0.182692


In [43]:
df_ensemble

AttributeError: 'DataFrame' object has no attribute 'id'

In [34]:
df_ensemble.replace(df_ensemble.min, 0)

TypeError: Expecting 'to_replace' to be either a scalar, array-like, dict or None, got invalid type 'method'