# Support Vector Machines: practical session

In this notebook you will find some data challenges to experiment with SVMs.

### Summary
 - SVM for Binary Classification
 - Evaluation of the Model
 - SVM for multiclass
     - example "by hand". Implement one-vs-all for each class
     - exo with automatic multiclass

In [28]:
# === imports
import pandas as pd
import numpy as np
import sklearn

%load_ext autoreload
%autoreload 2

### Binary Classification

The natural application of SVM methods is that of binary classification.

#### Americans and Atheism  
Let us find out how americans tolerated atheists in 1976.
The dataset is described as follows:

```
Dataset:  atheist.dat

Source: E. Filsinger (1976). "Tolerance of Non-Believers: A Cross-Tabular
 and Log Linear Analysis of Some Religious Correlates," Review of Religious
 Research, Vol.17, #3, pp.232-240

Description: Church Attendance and Tolerance for Atheists for survey
of 1221 people.

Variables/Columns 
Church Attendance   8  /* 1=Never, 2=Yearly, 3=Monthly, 4=Weekly */
Tolerance for Atheists  16 /* 1=Low, 2=High  */
```

### Exercise 1: retrieve and understand data

The file is available online at: http://users.stat.ufl.edu/~winner/datasets.html  

 - Find the corresponding file
 - save it in the 'data' directory
 - import it as a pandas dataframe. No cheating, it can be done with a pandas function.
 
Your dataframe should have 1221 rows and 2 columns, which you'll name respectively 'attendance', 'tolerance'.

In [29]:
# === Solution
data_file = 'data/atheist.dat'
df = pd.read_csv(data_file, sep='\s+', header=None, names=['attendance', 'tolerance'])
print(df.shape)       # --> (nrow, ncols)
print(df.iloc[0:5,:]) # print first lines of dataset

(1221, 2)
   attendance  tolerance
0           1          2
1           1          2
2           1          2
3           1          2
4           1          2


### Exercise 2: run SVM. Does it makes sense?

In 1976, the inventors of SVMs were probably still learning to ride their bycicles.  
Let us see whether it make any sense to use them to predict the tolerance of US citizens towards atheists.

First of all, what can we learn having a strong look at the data and their description?  
Do we have categorical, binary, continuous data?

These are fundamental questions, which you'll ask yourselves for the rest of your (data science) life.
Often, collegues or clients have no idea about these formalities; they have a problem and they ask you to provide a good enough solution.

*Answer* : it looks like we could consider them as continuous. However, they were recorded on a step scale. [...]

Let the fun begin. We want to do SVM classification with scikit-learn, but we know nothing about its api's!  
Thou shalt fear no more, the word of the Doc is with us:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm

We are doing SVM classification, could it be: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC ?
Let us copy-paste the example at the bottom of the page. Run the following:

In [33]:
from sklearn.svm import SVC
clf = SVC(gamma='auto')
clf.fit(X, y) 

NameError: name 'X' is not defined

Of course, "'X' is not defined".  
Let us take a moment to ponder on how to name things.
For sure, "x" and "y" bear with them a strong meaning to all those with scientific background.
However, words bear meanings too; oftentimes, when we do some ML modeling, we are having a data-dialogue with someone, ourselves included.
Let us do the favor to the "us" of the future and use a meaningful description for the variables.

In [43]:
clf.fit(X=df['attendance'], y=df['tolerance']) 

ValueError: Expected 2D array, got 1D array instead:
array=[1. 1. 1. ... 4. 4. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Something just went horribly wrong.

In [53]:
toler = np.reshape(df['tolerance'], -1)
attend = np.reshape(df['attendance'], -1)

clf.fit(X=attend, y=toler) 

ValueError: Expected 2D array, got 1D array instead:
array=[1. 1. 1. ... 4. 4. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [31]:
# TODO: things will (?) fail miserably. Let us see whyL

### Lesson learned: some data are more svm-able than others  
Indeed, 