In [4]:
using CSV, DataFrames, PyPlot, ScikitLearn, Random

# support vector classifier
@sk_import svm : SVC
# K-folds cross validation
using ScikitLearn.CrossValidation: KFold



## read in the data
The source of the data is [here](https://archive.ics.uci.edu/ml/datasets/Wine).

Each row of `wine_data.csv` represents measurements on a different bottle of wine-- one of two varieties. The three columns are:
* `class`: the label, i.e. what variety/class of wine it is. the label is not explicitly given, but think: Pinot Noir (-1) vs. Syrah (1).
* `alcohol`: the first feature, percent alcohol in the wine
* `malic_acid`: the second feature, malic acid concentration in the wine

In [2]:
df = CSV.read("wine.csv", copycols=true)
first(df, 5)

Unnamed: 0_level_0,class,alcohol,malic_acid
Unnamed: 0_level_1,Int64,Float64,Float64
1,-1,12.37,0.94
2,-1,12.33,1.1
3,-1,12.64,1.36
4,-1,13.67,1.25
5,-1,12.37,1.13


how many wines are in each class?

## visualize the data

draw a scatter plot of the data scattered in 2D feature space. color each data point by the class label. use hollow circles to help see points that are overlapping.

## getting data ready for input to scikitlearn

to build a predictive model in scikitlearn:
* construct a feature matrix `X` that has `n_wines` rows and `2` columns (one column for each feature)
* construct a column vector `y` with the labels

loop through the rows of the wine `DataFrame` and populate each entry of the feature matrix `X` and target vector `y` with appropriate values

## training a support vector machine (SVM)

train a support vector machine to classify wines using *all* of the data. evaluate the accuracy on the training data. we'll later show through cross-validation that this is an overestimate of the true accuracy of the SVM classifier on unseen data. use `C=1.0`. use the linear kernel. [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) is the documentation for the `SVC` function (Support Vector Classifier) in scikitlearn.

```julia
# use a linear kernel
clf = SVC(kernel="linear", C=1.0)
```

## visualize the decision boundary

draw the decision boundary (in feature space) learned by the SVM trained on test data. Also plot the data in feature space (with the decision boundary) with different colors/symbols for the different classes (exactly as in `# visualize the data`). hint: follow the class notes for k-nearest neighbors, using `contourf`, but this is not the only way.

## $K=5$-fold cross validation
use $K=5$-fold cross validation to:
* choose the optimal `C` parameter in the SVM classifier
* assess the accuracy of the model on unseen data

plot the average test set accuracy (average over the $K$ folds) against the `C` parameter used.

report the best `C` parameter and the associated average test set error (`argmax` might be useful). This test set error is a quality metric of how well the SVM will perform on new, unseen data that is not in the training set. explore the following set of `C` parameters: `c_params = 10.0 .^ range(-3, stop=0, length=25)`. 

In [3]:
K = 5 # number of folds

c_params = 10.0 .^ range(-3, stop=0, length=25)


25-element Array{Float64,1}:
 0.001                
 0.001333521432163324 
 0.0017782794100389228
 0.0023713737056616554
 0.0031622776601683794
 0.004216965034285823 
 0.005623413251903491 
 0.007498942093324558 
 0.01                 
 0.01333521432163324  
 0.01778279410038923  
 0.023713737056616554 
 0.03162277660168379  
 0.042169650342858224 
 0.05623413251903491  
 0.07498942093324558  
 0.1                  
 0.1333521432163324   
 0.1778279410038923   
 0.23713737056616552  
 0.31622776601683794  
 0.4216965034285822   
 0.5623413251903491   
 0.7498942093324559   
 1.0                  

my conclusion: 