# Model Quality Assessment

## Part 1: Multinomial Naive Bayes Classifier on Mushroom Dataset

### 1. Introduction
Multinomial Naive Bayes is a variation of the Naive Bayes classification algorithm that works with categorical features. It is traditionally used for text classification tasks such as spam filtering and sentiment analysis, but it can also be applied to datasets with categorical features, such as the Mushroom Dataset. It contains features that describe different mushroom species, classified as either edible or poisonous.

### 2. About the Mushroom Dataset
The Mushroom dataset provides information about many mushroom species from the Agaricus and Lepiota families. These species are identified based on a range of characteristics, and the goal is to classify them as either edible (e) or poisonous (p).
This dataset consists of more than 8000 samples. They include 23 species of gilled mushrooms, with attributes such as cap shape, cap surface, odor, gill attachment, stalk shape, and more.

**Attribute Information:**
cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

bruises: bruises=t,no=f

odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

gill-attachment: attached=a,descending=d,free=f,notched=n

gill-spacing: close=c,crowded=w,distant=d

gill-size: broad=b,narrow=n

gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

stalk-shape: enlarging=e,tapering=t

stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,**missing=?**

stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

veil-type: partial=p,universal=u

veil-color: brown=n,orange=o,white=w,yellow=y

ring-number: none=n,one=o,two=t

ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

### 3. Model Overview
The Multinomial Naive Bayes algorithm is based on the assumption that the features are conditionally independent. This simplifies computations, but it may not always reflect real-world relationships, because some features in the Mushroom dataset in fact could be correlated. Despite this, the independence assumption often works well in practice, as it captures the overall structure of the data without requiring too much computations.

The model also assumes that features are categorical. For datasets with continuous features, this approach would not be suitable, and a different variant, such as Gaussian Naive Bayes, would be necessary.

The classifier is implemented using NumPy and collections.defaultdict for managing feature probabilities. train_test_split function from sklearn library was also used.

### 4. Model Evaluation and Performance
The performance of the model is evaluated based on the accuracy of its predictions. Accuracy is a proportion of correct predictions compared to its total number. The train-test split ensures that the model is evaluated on data that wasn't seen before.

The average classifier's accuracy for 100 tests was 99.1%. This value shows the model's ability to classify mushrooms correctly by given features.

## Part 2: Gaussian Naive Bayes Classifier on Iris Dataset

### 1. Introduction
Gaussian Naive Bayes Classificator is another version of the Naive Bayes algorithm that works with continuous data. The Gaussian version assumes that features follow a normal distribution.

### 2. About the Iris Dataset
The Iris dataset contains measurements for three classes of iris flowers, each consisting of 50 samples. The goal is to classify flowers into species based on provided features.

Features in the Dataset:
- Sepal length (in cm)
- Sepal width (in cm)
- Petal length (in cm)
- Petal width (in cm)

Class Labels:
- Setosa
- Versicolor
- Virginica

This dataset is suitable for Gaussian Naive Bayes Classificator because all features are continuous.

### 3. Model Overview
The model calculates probabilities using the Gaussian probability density function for each feature. Main assumption is that features are normally distributed. This simplifies the probability calculations but main good classification ability for continuous data.

The model is implemented using NumPy and collections.defaultdict. The train-test split is performed using the train_test_split function from sklearn.

### 4. Model Evaluation and Performance
The performance of the Gaussian Naive Bayes model is measured by its ability to classify flowers into species. The dataset is divided into training and testing sets to evaluate the model's generalization performance.

The average classifier's accuracy for 100 tests was 95.3%.
The combination of small dataset size, overlapping feature distributions between classes, and imperfect adherence to the Gaussian assumption explains why the accuracy is not as high as in the previous case.