# ITESM - MCC - Intelligent Systems
## Exam 3 - Exercise 6

### Carlos Eduardo Hernandez Rincon
### Student ID: A01181616

#### May 15th 2019


## Instructions

15p 6.- Given the data set “wineTestModel.csv”, construct a Jupyter notebook (kernel Python or R) to determine for each of its 6 instances the cultivar where they come from. For that, the “wine.csv” is provided which contain 172 records of 13
chemical variables and a cultivar class. The chemical variables determine the cultivar where they come from.

You will use Gaussian Naive Bayes procedure as shown in class. That is, you will use the Bayes’ Theorem and the Normal distribution of probabilities shown below.

where π=3.14159 ... and e=2.71828

It is prohibited utilize any library to compute the Gaussian Naive Bayes.


## Before running

Make sure that your Jupyter can access the following libraries:
1. numpy==1.16.3
2. pandas==0.24.2


## 1. Begin by importing all needed libraries


In [1]:
import pandas as pd
import math


## 2. Define the location for the training and test data sets


In [2]:
TRAINING_DATASET_PATH = "wine.csv"
TESTING_DATASET_PATH = "wineTestModel.csv"


## 3. Define the functions needed to calculate Gaussian distribution and Naïve Bayes Probability
In `calculate_normal_probability_distribution` we calculate the distribution using:
1. pi = 3.14159
2. euler = 2.71828

In `calculate_gaussian_naive_bayes` we take all of the features posteriors, along with the label posterior to calculate
the probability, in other words, this function calculates the posterior of the Bayes Theorem:

> P(Feature<sub>1</sub> | Label) * ... * P(Feature<sub>n</sub> | Label) * P(Label)


In [3]:
def calculate_normal_probability_distribution(mean: float, stdev: float, x: float):
    pi = 3.14159
    euler = 2.71828

    first_part = 1 / (stdev * math.sqrt((2 * pi)))

    euler_power = -0.5 * math.pow((x - mean) / stdev, 2)
    second_part = math.pow(euler, euler_power)

    probability = first_part * second_part

    return probability


def calculate_gaussian_naive_bayes(features_posteriors: list, label_posterior: float):
    posterior = 1

    # Multiply all of the features posteriors to in order to calculate Naive Bayes
    for fp in features_posteriors:
        posterior *= fp

    posterior = posterior * label_posterior

    return posterior


## 4. Read the training data set
In order to better parse our dataset, we'll apply a couple of transformations:
1. Ignore any whitespace in the input CSV file and store all in a *Pandas DataFrame*.
2. Sort the whole dataset by the label column **cultivar**
3. Separate each of the records by label in a *DataFrame* of their own.
    * Since we are grouping the records by label, we can drop the **cultivar** column for now
4. Calculate the mean for each of the features for each of the labels
5. Calculate the Sample Standard Deviation for each of the features for each of the labels


In [4]:
# Read the training dataset
data = pd.read_csv(TRAINING_DATASET_PATH, skipinitialspace=True, skip_blank_lines=True)

data.sort_values("cultivar", inplace=True)

wine_possible_labels = data["cultivar"].unique()

label_1 = data.loc[data['cultivar'] == 1].drop("cultivar", axis=1)
label_2 = data.loc[data['cultivar'] == 2].drop("cultivar", axis=1)
label_3 = data.loc[data['cultivar'] == 3].drop("cultivar", axis=1)

label_1_means = label_1.mean(axis=0)
label_2_means = label_2.mean(axis=0)
label_3_means = label_3.mean(axis=0)

# We pass ddof to calculate the Sample Std Dev
label_1_stdev = label_1.std(axis=0, ddof=1)
label_2_stdev = label_2.std(axis=0, ddof=1)
label_3_stdev = label_3.std(axis=0, ddof=1)


## 5. Read the testing dataset and prepare it
Since we don't know what the *cultivar* label will be, we can drop the column. We then prepare the variables for the
calculation.
1. Since we are using Pandas DataFrames, it is useful to retrieve the list of features so that we can
quickly access the information.
2. Calculate the posterior for each of the labels
    * This is done by calculating the proportion of the labeled records
3. Add the posterior, means and stdev information in symmetric lists so that we can retrieve information
from them using the same index


In [5]:
# Read the test dataset
test_data = pd.read_csv(TESTING_DATASET_PATH, skipinitialspace=True, skip_blank_lines=True)

# Drop the cultivar label as this is the one we will calculate
test_data.drop("cultivar", axis=1, inplace=True)

# Prepare the variables that we'll use for the calculation and drop the cultivar label
feature_list = list(test_data.columns.values)

# The posteriors, that is P(label_1), is equal to the number of cultivar==1, 2 or 3 divided by the total num of records
posterior_1 = len(label_1) / len(data)
posterior_2 = len(label_2) / len(data)
posterior_3 = len(label_3) / len(data)

# We'll add all of the information in symmetric lists in order to ease the calculation loop structure
cultivar_label_posteriors = [posterior_1, posterior_2, posterior_3]
means = [label_1_means, label_2_means, label_3_means]
stdevs = [label_1_stdev, label_2_stdev, label_3_stdev]


## 6. Begin the calculation for each of the test rows
The way we will handle this is by:
1. Iterating over each of the rows
2. Iterate for each of the 3 labels
3. Calculate P(feature | cultivar == n) for each of the features
    * We'll call the `calculate_normal_probability_distribution` defined earlier
    * We'll save the information in a list in order to multiply them later 
4. Once we have each of the posteriors of the features given the label we calculate the Posterior
for that set of features.
    * We pass the list of feature posteriors to `calculate_gaussian_naive_bayes` which will multiply the whole
    list and the P(label)
5. We'll save the P(label | features<sub>1...n</sub>) in a list in order to determine the max value and thus, the label


In [6]:

determined_labels = []

# Begin the calculation
for row_index, row in test_data.iterrows():
    print("\n\n***************************")
    print(f"Calculating label for row {row_index + 1}...")
    print("***************************\n")

    cultivar_label_bayes_probabilities = []

    # Check each of the labels
    for label_idx in range(3):

        print(f"\n-------Calculating P(cultivar == {label_idx + 1})...\n")

        features_gauss = []

        for feature in feature_list:
            feature_x = row[feature]
            feature_mean = means[label_idx][feature]
            feature_stdev = stdevs[label_idx][feature]

            feature_gauss = calculate_normal_probability_distribution(mean=feature_mean,
                                                                      stdev=feature_stdev,
                                                                      x=feature_x)

            features_gauss.append(feature_gauss)

            print(f"-------Feature: {feature} -------")
            print(f"X = {feature_x}")
            print(f"Feature mean = {feature_mean}")
            print(f"Feature Std Dev = {feature_stdev}")
            print(f"P({feature}| Cultivar == {label_idx +1}) = {feature_gauss}")

        # Calculate the posterior of the label given all of the features P(
        cultivar_posterior = cultivar_label_posteriors[label_idx]

        label_probability = calculate_gaussian_naive_bayes(features_gauss, cultivar_posterior)

        cultivar_label_bayes_probabilities.append(label_probability)

        print(f"\n >>>>>>>> P(cultivar == {label_idx + 1} | {','.join(feature_list)}): {label_probability}")

    # To get the final value, we just get the index in the label probabilities and offset the value by 1
    final_cultivar = cultivar_label_bayes_probabilities.index(max(cultivar_label_bayes_probabilities)) + 1
    determined_labels.append(final_cultivar)

    print(f"\n ================ RESULTS FOR ROW {row_index + 1} =================")
    print("Given the following Bayes posteriors for each label:")

    for result_idx, result in enumerate(cultivar_label_bayes_probabilities):
        print(f"P(cultivar == {result_idx + 1}): {result}")

    print(f"ROW {row_index + 1} is probably Cultivar {final_cultivar}")




***************************
Calculating label for row 1...
***************************


-------Calculating P(cultivar == 1)...

-------Feature: alcohol -------
X = 14.06
Feature mean = 13.75140350877193
Feature Std Dev = 0.4590729734883864
P(alcohol| Cultivar == 1) = 0.6932743534246821
-------Feature: malic_acid -------
X = 2.15
Feature mean = 2.012456140350877
Feature Std Dev = 0.6997481197385915
P(malic_acid| Cultivar == 1) = 0.5592148337601981
-------Feature: ash -------
X = 2.61
Feature mean = 2.4591228070175437
Feature Std Dev = 0.2252798315568122
P(ash| Cultivar == 1) = 1.4151043549810103
-------Feature: alcalinity_of_ash -------
X = 17.6
Feature mean = 17.0280701754386
Feature Std Dev = 2.590280899409292
P(alcalinity_of_ash| Cultivar == 1) = 0.15030625266898287
-------Feature: magnesium -------
X = 121.0
Feature mean = 106.0701754385965
Feature Std Dev = 10.499761305318883
P(magnesium| Cultivar == 1) = 0.013825801822313126
-------Feature: total_phenols -------
X = 2.6
Feature

## 7. Format the final DataFrame
We insert the `determined_labels` list as a column of the DataFrame for better formatting


In [7]:
# Adjust the DataFrame to the determined cultivar label
test_data.insert(loc=0, column='cultivar', value=determined_labels)

print("Final Test Dataset:")
print(test_data.to_string())


Final Test Dataset:
   cultivar  alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  od280/od315_of_diluted_wines  proline
0         1    14.06        2.15  2.61               17.6        121           2.60        2.51                  0.31             1.25             5.05  1.06                          3.58     1295
1         1    13.05        1.77  2.10               17.0        107           3.00        3.00                  0.28             2.03             5.04  0.88                          3.35      885
2         2    11.96        1.09  2.30               21.0        101           3.38        2.14                  0.13             1.65             3.21  0.99                          3.13      886
3         2    12.42        1.61  2.19               22.5        108           2.00        2.09                  0.34             1.61             2.06  1.06                          2.96     