**Links**:
* https://www.kaggle.com/warther/svm-logreg
* https://www.datacamp.com/community/tutorials/decision-tree-classification-python

## Importing required libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split # Import train_test_split function

In [2]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("diabetes.csv", header=0, names=col_names)
submission = pd.read_csv("diabetes.csv", header=0, names=col_names)

In [3]:
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
print (pima.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnant    768 non-null int64
glucose     768 non-null int64
bp          768 non-null int64
skin        768 non-null int64
insulin     768 non-null int64
bmi         768 non-null float64
pedigree    768 non-null float64
age         768 non-null int64
label       768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


## Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let's split the dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size.

In [5]:
##### Split between X and y #####

X=pima.drop('label',axis=1)
y=pima['label']

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

print('train size is %i'%y_train.shape[0])
print('test size is %i'%y_test.shape[0])

train size is 537
test size is 231


## Building SVM Model

SVM and Logistic Regression will be tested in a pipeline with preprocessing jobs: an imputer, a polynomial transformation and a data scale.

We will put the two algorithm into a loop.

The results will be anlysed with a ROC curve, and learning curve will be used to search how we can improve our score.

In [6]:
from sklearn.svm import SVC
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

In [7]:
classifier = SVC(random_state=0, kernel='rbf',decision_function_shape='ovo')

classifier.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = classifier.predict(X_test)



## Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.

Accuracy can be computed by comparing actual test set values and predicted values.

In [8]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

report=metrics.classification_report(y_test,y_pred)

clf_name=['SVM','RegLog']

print('Reporting for %s:'%clf_name)

print(report)

Accuracy: 0.6320346320346321
Reporting for ['SVM', 'RegLog']:
              precision    recall  f1-score   support

           0       0.63      1.00      0.77       146
           1       0.00      0.00      0.00        85

    accuracy                           0.63       231
   macro avg       0.32      0.50      0.39       231
weighted avg       0.40      0.63      0.49       231



  'precision', 'predicted', average, warn_for)


## Create a copy of the original file, so we can run the svm_label after on Decision Tree

In [9]:
original = pima.drop('label',axis=1)
original.head()

#Predict the response for all dataset to compare results
pima_pred = classifier.predict(original)
print(pima_pred)

[0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0 0
 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0
 0 1 0 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 1 1
 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0
 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 1
 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1
 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 1 1 0 0 0 0 0 0 0 

In [10]:
submission["svm_label"] = pima_pred

del submission['label']

submission.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label,svm_label
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,1
3,1,89,66,23,94,28.1,0.167,21,0,0
4,0,137,40,35,168,43.1,2.288,33,1,1
...,...,...,...,...,...,...,...,...,...,...
695,7,142,90,24,480,30.4,0.128,43,1,1
696,3,169,74,19,125,29.9,0.268,31,1,1
697,0,99,0,0,0,25.0,0.253,22,0,0
698,4,127,88,11,155,34.5,0.598,28,0,0


In [11]:
submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
pregnant     768 non-null int64
glucose      768 non-null int64
bp           768 non-null int64
skin         768 non-null int64
insulin      768 non-null int64
bmi          768 non-null float64
pedigree     768 non-null float64
age          768 non-null int64
label        768 non-null int64
svm_label    768 non-null int64
dtypes: float64(2), int64(8)
memory usage: 60.1 KB


### Export a new csv for this execution

In [12]:
submission.to_csv("diabetes_svm.csv", index=False)