# Naive Bayes - Trabalho

## Questão 1

Implemente um classifacor Naive Bayes para o problema de predizer a qualidade de um carro. Para este fim, utilizaremos um conjunto de dados referente a qualidade de carros, disponível no [UCI](https://archive.ics.uci.edu/ml/datasets/car+evaluation). Este dataset de carros possui as seguintes features e classe:

** Attributos **
1. buying: vhigh, high, med, low
2. maint: vhigh, high, med, low
3. doors: 2, 3, 4, 5, more
4. persons: 2, 4, more
5. lug_boot: small, med, big
6. safety: low, med, high

** Classes **
1. unacc, acc, good, vgood

In [293]:
import math
import random
import pandas as pd
import numpy as np

In [383]:
def _split_dataset(dataset, splitRatio):
    msk = np.random.rand(len(df)) < splitRatio
    train = df[msk]
    test = df[~msk]
    return train, test

def _mean(numbers):
    return sum(numbers) / float(len(numbers))
 
def _stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
    return math.sqrt(variance)

class NavieBayes:
    def __init__(self, dataset):
        self.dataset = dataset
        unique_data = [df[label].unique() for label in labels]
        legend = []

        for i, label in enumerate(labels):
            _dict = {}
            for j, cat in enumerate(unique_data[i]):
                _dict[cat] = j
            df[label] = df[label].map(_dict)
            legend.append(_dict)

    def _calculate_probability(self, x, mean, stdev):
        if stdev == 0:
            stdev = 10000
        exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
        return (1 / (math.sqrt(2 * math.pi) * math.pow(stdev, 2))) * exponent

    def _summarize(self, dataset):
        summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
        del summaries[-1]
        return summaries
    
    def _predict_for_row(self, row_input):
        self._calculate_class_probabilities(row_input)
        chosen_label, chosen_prob = None, -1
        for class_code, probability in self.probabilities.items():
            if chosen_label is None or probability > chosen_prob:
                chosen_prob = probability
                chosen_label = class_code
        return chosen_label
     
    def _separate_by_class(self):
        separated = {}
        for _, row in self.dataset.iterrows():
            if row[-1] not in separated:
                separated[row[-1]] = []
            separated[row[-1]].append(row)
        
        self.separated = separated

    def _calculate_class_probabilities(self, row_input):
        probabilities = {}
        for class_value, class_summaries in self.summaries.items():
            probabilities[class_value] = 1
            for i, class_summary in enumerate(class_summaries):
                mean, stdev = class_summary
                x = row_input[i]
                probabilities[class_value] *= self._calculate_probability(x, mean, stdev)
        self.probabilities = probabilities

    def _summarize_by_class(self):
        self._separate_by_class()
        summaries = {}
        for key, data in self.separated.items():
            summaries[key] = self._summarize(data)
        self.summaries = summaries
    
    def predict(self):
        predictions = []
        for _, test_row in self.test_set.iterrows():
            result = self._predict_for_row(test_row)
            predictions.append(result)
        self.predictions = predictions
        return self.predictions
    
    def split(self, splitRatio):
        train, test = _split_dataset(self.dataset, splitRatio)
        self.test_set = test
        self.train_set = train
    
    def fit(self):
        self._summarize_by_class()
        
    def get_accuracy(self):
        correct = 0
        for i in range(len(self.test_set)):
            if self.test_set.iloc[i][-1] == self.predictions[i]:
                correct += 1
        return (correct/float(len(self.test_set)))


In [384]:
labels = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "label"]
df = pd.read_csv('carData.csv', sep=',', names=labels)

In [385]:
nb = NavieBayes(df)
nb.split(0.7)
nb.fit()

my_predictions = nb.predict()

In [386]:
nb.summaries


{0: [(1.350413223140496, 1.1180986436047557),
  (1.3669421487603306, 1.127547271846503),
  (1.4545454545454546, 1.1271519258821532),
  (0.7900826446280992, 0.8358593601329547),
  (0.9322314049586777, 0.8197535262768612),
  (0.7528925619834711, 0.8027634632071083)],
 1: [(1.5755208333333333, 1.0419527457466646),
  (1.5911458333333333, 1.0481986236664393),
  (1.5859375, 1.0949254686892318),
  (1.484375, 0.5004077971479614),
  (1.1015625, 0.799867910929502),
  (1.53125, 0.4996735226553633)],
 2: [(2.6, 0.49371044145328763),
  (2.2, 0.7541551564499178),
  (1.7692307692307692, 1.0572551544156505),
  (1.5384615384615385, 0.5023980952928128),
  (1.6153846153846154, 0.4902903378454599),
  (2.0, 0.0)],
 3: [(2.6666666666666665, 0.4748580799338168),
  (2.6666666666666665, 0.47485807993381696),
  (1.565217391304348, 1.104512946553756),
  (1.4782608695652173, 0.5031867754087856),
  (1.0434782608695652, 0.8123093913741108),
  (1.434782608695652, 0.4993602044724244)]}

### Acurácia

In [387]:
my_accuracy = nb.get_accuracy() * 100
my_accuracy

67.8030303030303

## Questão 2
Crie uma versão de sua implementação usando as funções disponíveis na biblioteca SciKitLearn para o Naive Bayes ([veja aqui](http://scikit-learn.org/stable/modules/naive_bayes.html)) 

In [388]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score

In [389]:
train, test = nb.train_set, nb.test_set
nb_model = GaussianNB()

train_values, train_labels = [None] * len(train), [None] * len(train)
test_values, test_labels = [None] * len(test), [None] * len(test)

for i in range(len(train)):
    row = train.iloc[i]
    train_values[i] = row[:-1]
    train_labels[i] = row[-1]

nb_model.fit(train_values, train_labels)

for i in range(len(test)):
    row = test.iloc[i]
    test_values[i] = row[:-1]
    test_labels[i] = row[-1]

nb_model.fit(train_values, train_labels)

their_predictions = nb_model.predict(test_values)

their_accuracy = accuracy_score(test_labels, their_predictions) * 100

### Acurácia

In [390]:
their_accuracy

68.56060606060606

## Questão 3

Analise a acurácia dos dois algoritmos e discuta a sua solução.

### Diferência de acurácia

Atualemente a diferença da acurácia é de:

In [391]:
'{}%'.format(abs(their_accuracy - my_accuracy))

'0.7575757575757649%'

Os valores se manteram dentro do intervalo de 0.5% a 7%. Sendo que ambas as acurácias se manteram no intervalo de 68% a 72%, aproximadamente.

### Métricas

Utilizando a função `classification_report` da biblioteca _sklearn_ para calcular as métricas relacionadas:

#### Implementação própia

In [392]:
print(classification_report(test_labels, my_predictions))

             precision    recall  f1-score   support

          0       1.00      0.64      0.78       360
          1       0.56      0.81      0.66       125
          2       0.00      0.00      0.00        18
          3       0.21      1.00      0.35        25

avg / total       0.83      0.68      0.71       528



  'precision', 'predicted', average, warn_for)


### Implementação do SKLearn

In [393]:
print(classification_report(test_labels, their_predictions))

             precision    recall  f1-score   support

          0       0.84      0.88      0.86       360
          1       0.67      0.16      0.26       125
          2       0.17      1.00      0.28        18
          3       0.60      0.24      0.34        25

avg / total       0.76      0.69      0.67       528

