## Implementation from scratch of a simple Naive Bayes classifier

Problem: given a class $K$ and a set of observed data about an item $x$, estimate

$$
P(K=1 \mid x ),
$$

where $K=1$ means that $x$ is an item of the class $K$.

Note that:
$$
P(K = 1 \mid x) = \frac{P(K=1)P(x \mid K=1)}{P(x)} \approx P(K=1)P(x \mid K=1)
$$

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

## Esercizio
L'implementazione contiene alcuni errori e bias che rendono l'algoritmo poco efficace. Revisionare il codice per sitemare gli errori. Osservare in particolare:
- Smooth: come gestire la probabilità 0 per una singola osservazione in modo che non annulli le altre
- Sbilanciamento del dataset: come possiamo allenare l'algoritmo a stimare le probabilità senza sovrastimare la classe più numerosa?

In [161]:
T = pd.read_csv('./data/zoo/zoo.csv')
K = pd.read_csv('./data/zoo/class.csv')

In [162]:
T.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


In [84]:
T_bool = T[[x for x in T.columns if x not in ['legs', 'animal_name']]]

In [85]:
legs_df = pd.get_dummies(T.legs, prefix='legs')

In [86]:
D = T_bool.join(legs_df)

In [139]:
def likelihood(class_name, feature):
    return (D[D.class_type==classname(
        class_name)][feature].sum() + 1) / (D[
        D.class_type==classname(class_name)].shape[0] +1)

In [140]:
class_dist = T.groupby('class_type').count()['animal_name']

In [141]:
def prior(x):
    return (class_dist / class_dist.sum()).loc[x]

classname = lambda name: K[K['Class_Type'] == name].Class_Number.values[0]
getname = lambda name: K[K['Class_Number'] == name].Class_Type.values[0]

In [142]:
likelihood('Mammal', 'legs_4')

0.7619047619047619

In [143]:
K

Unnamed: 0,Class_Number,Number_Of_Animal_Species_In_Class,Class_Type,Animal_Names
0,1,41,Mammal,"aardvark, antelope, bear, boar, buffalo, calf,..."
1,2,20,Bird,"chicken, crow, dove, duck, flamingo, gull, haw..."
2,3,5,Reptile,"pitviper, seasnake, slowworm, tortoise, tuatara"
3,4,13,Fish,"bass, carp, catfish, chub, dogfish, haddock, h..."
4,5,4,Amphibian,"frog, frog, newt, toad"
5,6,8,Bug,"flea, gnat, honeybee, housefly, ladybird, moth..."
6,7,10,Invertebrate,"clam, crab, crayfish, lobster, octopus, scorpi..."


## Pre-processing

In [144]:
def extract_features(animal_index, df):
    features = []
    row = df.loc[animal_index]
    for k, v in row.items():
        if v > 0:
            features.append(k)
    return features

def get_class(animal_index, df):
    ctype = df.loc[animal_index].class_type
    return getname(ctype)

In [154]:
def predict(animal_i, klass):
    features = extract_features(animal_index=0, df=D)
    p = 1.0
    for feature in features:
        p = p * likelihood(klass, feature)
    return prior(classname(klass)) * p

In [155]:
def read_prediction(animal):
    all_classes = np.array([predict(animal, k) for k in K.Class_Type.values])
    return K.Class_Type.values[all_classes.argmax()]

In [148]:
for animal in T.index.values:
    print(T.loc[animal].animal_name, '\t', get_class(animal, T), '\t', read_prediction(animal))

aardvark 	 Mammal 	 Mammal
antelope 	 Mammal 	 Mammal
bass 	 Fish 	 Mammal
bear 	 Mammal 	 Mammal
boar 	 Mammal 	 Mammal
buffalo 	 Mammal 	 Mammal
calf 	 Mammal 	 Mammal
carp 	 Fish 	 Mammal
catfish 	 Fish 	 Mammal
cavy 	 Mammal 	 Mammal
cheetah 	 Mammal 	 Mammal
chicken 	 Bird 	 Mammal
chub 	 Fish 	 Mammal
clam 	 Invertebrate 	 Mammal
crab 	 Invertebrate 	 Mammal
crayfish 	 Invertebrate 	 Mammal
crow 	 Bird 	 Mammal
deer 	 Mammal 	 Mammal
dogfish 	 Fish 	 Mammal
dolphin 	 Mammal 	 Mammal
dove 	 Bird 	 Mammal
duck 	 Bird 	 Mammal
elephant 	 Mammal 	 Mammal
flamingo 	 Bird 	 Mammal
flea 	 Bug 	 Mammal
frog 	 Amphibian 	 Mammal
frog 	 Amphibian 	 Mammal
fruitbat 	 Mammal 	 Mammal
giraffe 	 Mammal 	 Mammal
girl 	 Mammal 	 Mammal
gnat 	 Bug 	 Mammal
goat 	 Mammal 	 Mammal
gorilla 	 Mammal 	 Mammal
gull 	 Bird 	 Mammal
haddock 	 Fish 	 Mammal
hamster 	 Mammal 	 Mammal
hare 	 Mammal 	 Mammal
hawk 	 Bird 	 Mammal
herring 	 Fish 	 Mammal
honeybee 	 Bug 	 Mammal
housefly 	 Bug 	 Mammal
kiwi 	 B

In [158]:
pd.get_dummies(T.legs, prefix='legs')

Unnamed: 0,legs_0,legs_2,legs_4,legs_5,legs_6,legs_8
0,0,0,1,0,0,0
1,0,0,1,0,0,0
2,1,0,0,0,0,0
3,0,0,1,0,0,0
4,0,0,1,0,0,0
...,...,...,...,...,...,...
96,0,1,0,0,0,0
97,0,0,0,0,1,0
98,0,0,1,0,0,0
99,1,0,0,0,0,0


## Train and test

In [159]:
from sklearn.model_selection import train_test_split

In [160]:
X_train, X_test, y_train, y_test = train_test_split(T[T.columns[:-1]], T.class_type, test_size=0.2, random_state=42)