**Links**:
* https://www.kaggle.com/warther/svm-logreg
* https://www.datacamp.com/community/tutorials/decision-tree-classification-python

## Importing required libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split # Import train_test_split function

In [2]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("diabetes_clean.csv", header=0, names=col_names)
svm_output = pd.read_csv("diabetes_clean.csv", header=0, names=col_names)

In [3]:
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,3,78,50,32,88,31.0,0.248,26,1


In [4]:
print (pima.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 9 columns):
pregnant    534 non-null int64
glucose     534 non-null int64
bp          534 non-null int64
skin        534 non-null int64
insulin     534 non-null int64
bmi         534 non-null float64
pedigree    534 non-null float64
age         534 non-null int64
label       534 non-null int64
dtypes: float64(2), int64(7)
memory usage: 37.7 KB
None


## Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let's split the dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size.

In [5]:
##### Split between X and y #####

X=pima.drop('label',axis=1)
y=pima['label']

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

print('train size is %i'%y_train.shape[0])
print('test size is %i'%y_test.shape[0])

train size is 373
test size is 161


## Building SVM Model

SVM and Logistic Regression will be tested in a pipeline with preprocessing jobs: an imputer, a polynomial transformation and a data scale.

We will put the two algorithm into a loop.

The results will be anlysed with a ROC curve, and learning curve will be used to search how we can improve our score.

In [6]:
from sklearn.svm import SVC
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [7]:
classifier = SVC(random_state=0, kernel='linear',decision_function_shape='ovo')

classifier.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = classifier.predict(X_test)

## Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.

Accuracy can be computed by comparing actual test set values and predicted values.

In [8]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

report=metrics.classification_report(y_test,y_pred)

clf_name=['SVM','RegLog']

print('Reporting for %s:'%clf_name)

print(report)

Accuracy: 0.7950310559006211
Reporting for ['SVM', 'RegLog']:
              precision    recall  f1-score   support

           0       0.79      0.91      0.84        98
           1       0.81      0.62      0.70        63

    accuracy                           0.80       161
   macro avg       0.80      0.76      0.77       161
weighted avg       0.80      0.80      0.79       161



## Confusion Matrix

In [9]:
dataset = pd.read_csv("diabetes_clean.csv", header=0, names=col_names)

y_dataset_true = dataset['label']
y_dataset_pred = classifier.predict(dataset.iloc[:,:-1])

from sklearn.metrics import confusion_matrix
confusion_matrix(y_dataset_true, y_dataset_pred)

array([[327,  30],
       [ 78,  99]], dtype=int64)

## Create a copy of the original file, so we can run the svm_label after on Decision Tree

In [10]:
original = pima.drop('label',axis=1)
original.head()

#Predict the response for all dataset to compare results
pima_pred = classifier.predict(original)

In [11]:
svm_output["svm_label"] = pima_pred

In [12]:
del svm_output['label']

In [13]:
svm_output.groupby('svm_label').count()

Unnamed: 0_level_0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age
svm_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,405,405,405,405,405,405,405,405
1,129,129,129,129,129,129,129,129


In [14]:
svm_output.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 9 columns):
pregnant     534 non-null int64
glucose      534 non-null int64
bp           534 non-null int64
skin         534 non-null int64
insulin      534 non-null int64
bmi          534 non-null float64
pedigree     534 non-null float64
age          534 non-null int64
svm_label    534 non-null int64
dtypes: float64(2), int64(7)
memory usage: 37.7 KB


### Export a new dataset with svm_outpu as new label

In [15]:
svm_output.to_csv("diabetes_svm.csv", index=False)