# Classification Based Machine Learning Algorithm

[An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction)

## Scikit-learn Definition:

**Supervised learning**, in which the data comes with additional attributes that we want to predict. This problem can be either:

* **Classification**: samples belong to two or more *classes* and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.


* **Regression**: if the desired output consists of one or more *continuous variables*, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

In [49]:
# import some library that you need
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

## Load the iris dataset

In [50]:
# choose the path of environment
# os.chdir("../dataset")
df = pd.read_csv("../dataset/iris.data")

In [51]:
# select fifth from the first row
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Wrangling

In [52]:
# the unique name of species
df.species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [53]:
# select the features into X
col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = df.loc[:, col]

In [54]:
# convert the species to number with map function and put target into y
species_to_num = {'Iris-setosa': 0,
                  'Iris-versicolor': 1,
                  'Iris-virginica': 2}
df['tmp'] = df['species'].map(species_to_num)
y = df['tmp']

## Visualisation

In [55]:
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [56]:
y

0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
      ..
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
149    2
Name: tmp, Length: 150, dtype: int64

## Splitting the train and test sets

In [57]:
# split (breaks into two parameters namely X and y)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=0)



##  Support Vector Machine (SVM) Training

In [58]:
# variabel with standarscaler function
sc_x = StandardScaler()

# Fit to data, then transform it
X_std_train = sc_x.fit_transform(X_train)

In [59]:
# process of training
clf = svm.SVC(kernel='linear', C=1.0, verbose=True)
clf.fit(X_std_train, y_train)

[LibSVM]

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=True)

## Measuring Accuracy

In [60]:
# Use cross_validate to measure generalization error, cv is validation generator if none use 3
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf, X_std_train, y_train, cv=3)
# y_train_pred = clf.predict(X_std_train)
# print(y_train_pred)

[LibSVM][LibSVM][LibSVM]

In [63]:
# generate guessed data
confusion_matrix(y_train, y_train_pred)

array([[47,  0,  0],
       [ 0, 38,  4],
       [ 0,  2, 44]])

In [64]:
# to see the accuration of score clf
clf.score(X_std_train, y_train)

0.9703703703703703