# Module 6 - Critical Thinking - Option 1
## David Edwards
### CSC510 - Foundations of Artificial Intelligence
### Colorado State University-Global Campus
### Dr. Isaac K. Gang
### August 29, 2021



For this assignment, we will be creating a Naive Bayes classifier using Scikit-learn.  

First, we need to import relevant functions.  For this case, we are using the dataset on wine that comes with sklearn.

In [37]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
import numpy

In [24]:
wine_data = load_wine()

We'll load the wine dataset which is used to categorize these features into some categories.

In [25]:
wine_data.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

So, given values for each of these features, what are we looking to categorize these wines into?  Red, Rose and White?

In [26]:
wine_data.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

Oh.  That's not very helpful.  According to [the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/wine) these represent wines produced from three different cultivars of grape.  A cultivar represents a plant variety produced by selective breeding. 

Despite looking into the description of the dataset, I can't find the actual cultivar names. 

In [27]:
wine_data.data.shape

(178, 13)

We have 178 records, each with the 13 feature names shown above

In [28]:
wine_data.data[0:2]

array([[1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
        3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, 1.120e+01, 1.000e+02, 2.650e+00,
        2.760e+00, 2.600e-01, 1.280e+00, 4.380e+00, 1.050e+00, 3.400e+00,
        1.050e+03]])

Looking at the data shows the values for each of these features.  I only display the first two rows because the numbers don't mean much to me.

We will divide the data into a training and testing sets.  The test_size of .3 means that we'll have 30% of the data for testing.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(wine_data.data, wine_data.target, test_size=0.3, random_state=0)

This retains 70% (124 rows) for training.

In [30]:
X_train.shape

(124, 13)

We will be using the Gaussian Naive Bayes model.

In [31]:
gnb = GaussianNB()

Now, we fit the model using our training data and targets.

In [32]:
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

Let's see how we did. First, we'll predict results from the test X values.

In [33]:
y_pred = gnb.predict(X_test)
y_pred

array([0, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 2, 0, 0, 2, 1, 0, 0, 2, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 1, 1, 2, 0, 0, 1, 1,
       1, 0, 2, 1, 2, 0, 2, 2, 0, 2])

And now we determine the accuracy.  94% is pretty good!

In [34]:
metrics.accuracy_score(y_test, y_pred)

0.9444444444444444

We can also determine the likelihood using the predict_proba function.  As you can see, the values generally show strong likeliehoods of being correct, however the third row does show some (small) likelihood of being incorrect.  I only show the top 10 rows for convenience sake.

In [48]:
y_prob = gnb.predict_proba(X_test)
numpy.round(y_prob,4)[0:10]

array([[1.    , 0.    , 0.    ],
       [0.    , 0.    , 1.    ],
       [0.0136, 0.9864, 0.    ],
       [1.    , 0.    , 0.    ],
       [0.    , 1.    , 0.    ],
       [0.    , 1.    , 0.    ],
       [1.    , 0.    , 0.    ],
       [0.    , 0.    , 1.    ],
       [0.    , 1.    , 0.    ],
       [0.    , 1.    , 0.    ]])

We can also see how many values we incorrectly predicted:

In [49]:
print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 54 points : 3
