## Machine Learning - Wine Dataset
- 13 features
- Classifies wine into 3 classes
- 178 samples
- No Data Wrangling (Clean Data)
- Feature Scaling using Standarization
- Classification using Random Forest

In [59]:
# Load libraries for Machine Learning
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

### Load the sciket learn builtin wine dataset example

In [60]:
data = load_wine()

### Note, the scikit learn example datasets datatype is not a numpy array or panda dataframe.

In [61]:
type(data)

sklearn.utils.Bunch

### The features are accessed with the property 'data' and the label (i.e., classification) with the property 'target'

In [62]:
X = data.data
y = data.target

In [63]:
# The features and labels are numpy arrays
type(X)
type(y)

numpy.ndarray

In [64]:
# Verify the contents of the X (Features) and y (Labels) arrays have 178 samples each, and the X array has dimensionality of 13
X.shape

(178, 13)

In [65]:
y.shape

(178,)

### Split the data in training and test data
- 70% training, 30% test
- random selection (non-sequential)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(wine, y,
                                                    test_size=0.30,
                                                    random_state=101)

### Scale the Data using Standardization

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [67]:
# View the data before scaling
X_train

array([[  1.18700000e+01,   4.31000000e+00,   2.39000000e+00, ...,
          7.50000000e-01,   3.64000000e+00,   3.80000000e+02],
       [  1.21700000e+01,   1.45000000e+00,   2.53000000e+00, ...,
          1.45000000e+00,   2.23000000e+00,   3.55000000e+02],
       [  1.23400000e+01,   2.45000000e+00,   2.46000000e+00, ...,
          8.00000000e-01,   3.38000000e+00,   4.38000000e+02],
       ..., 
       [  1.27200000e+01,   1.81000000e+00,   2.20000000e+00, ...,
          1.16000000e+00,   3.14000000e+00,   7.14000000e+02],
       [  1.41200000e+01,   1.48000000e+00,   2.32000000e+00, ...,
          1.17000000e+00,   2.82000000e+00,   1.28000000e+03],
       [  1.24700000e+01,   1.52000000e+00,   2.20000000e+00, ...,
          1.16000000e+00,   2.63000000e+00,   9.37000000e+02]])

In [68]:
# Create an instance of the Scaler Class
scaler = StandardScaler()

In [69]:
# Note, fit will scale the data and transform will transform the data back into a numpy_array.
# Since the scale has already been fitted with the X_train data, for X_test we only need to do a transform.
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

In [70]:
# View the data after it has been scaled
X_train

array([[-1.4421205 ,  1.8516745 ,  0.17426739, ..., -0.87833289,
         1.37995589, -1.15691894],
       [-1.07414682, -0.73489335,  0.70876951, ...,  2.11191244,
        -0.5311054 , -1.23613021],
       [-0.8656284 ,  0.16950101,  0.44151845, ..., -0.66474394,
         1.02756161, -0.97314879],
       ..., 
       [-0.39952841, -0.40931138, -0.55112833, ...,  0.87309652,
         0.70227458, -0.09865636],
       [ 1.31768209, -0.70776152, -0.09298366, ...,  0.91581431,
         0.26855854,  1.69468681],
       [-0.70617314, -0.67158574, -0.55112833, ...,  0.87309652,
         0.01103965,  0.60790817]])

### Let's classify the training data to build a model using the ensemble method - Random Forest
- n_estimators: the number of trees
- criterion: method to split the data

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [71]:
# Train a model (using 5 trees)
model = RandomForestClassifier( n_estimators = 5, criterion = 'entropy', random_state = 101 )
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=101, verbose=0, warm_start=False)

### Let's use the model to make predictions with the test data
- Run the test data through the model to prediction the label (classification)
- Compare the predicted values to the actual values
- Determine accuracy of our model from the test data

In [72]:
y_pred = model.predict(X_test)

In [73]:
# View the predicted values
y_pred

array([0, 0, 2, 0, 2, 1, 2, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 1, 1, 2, 1,
       2, 0, 0, 1, 1, 2, 1, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 0, 1, 2, 1,
       1, 2, 2, 1, 0, 1, 1, 0])

In [74]:
# View the actual values
y_test

array([0, 0, 2, 0, 2, 1, 2, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 2, 1, 1, 1, 2, 2,
       2, 0, 0, 1, 1, 2, 1, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 0, 1, 2, 1,
       1, 2, 2, 1, 0, 1, 1, 0])

## Determine the accuracy of our predictions with the test data
- Display matrix of shape nclasses X nclasses
- The number of correct predictions per class will be along the diagonal.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [75]:
# Create a confusion matrix with the actual and predicted values
cm = confusion_matrix(y_test, y_pred)

In [78]:
# View results (52 correct, 2 incorrect)
cm

array([[18,  1,  0],
       [ 0, 22,  0],
       [ 0,  1, 12]], dtype=int64)

In [79]:
# Calculate our accuracy
accuracy = 52 / 54
accuracy

0.9629629629629629