# Understanding Classification in ML

**Supervised machine learning** techniques involve training a model to operate on a set of _features_ and predict a _label_ using a dataset that incluses some already known label values. 

You can think of this function like this, in which **_y_** represents the label we want to predict and **_X_** represents the vector of features the model uses to predict it.

$$y = f([x_1, x_2, x_3, ...])$$


**Classificatioin** is a form of supervised machine learning algorithm in which you train a model to use the features (the _x_ value of our function) to predict a label(_y_) that calculates the probability of the observed case belonging to each of a number of possible classes, and predicting an appropriate label. 

The simplest form of classification is _binary classification_, in which the label is 0 or 1, representing two classes; for example "True" or "False"; "Internal" or "External" and so on.

# Binary Classification



Load the dataset. This data consists of diagnostic information about some patients who have been tested for diabetes. note that the final column in the dataset (**Diabetic**) contains the value **_0_** for patients who tested negative for diabetes, and **_1_** for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.

In [1]:
import pandas as pd

# load the diabetes dataset for training
diabetes = pd.read_csv('./../../data/diabetes.csv')
diabetes.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


Seperate the features X and the label y:

In [10]:
features = diabetes.columns.drop(['PatientID', 'Diabetic'])
label = 'Diabetic'

X, y = diabetes[features].values, diabetes[label].value_counts

print(X, y)

[[0.00000000e+00 1.71000000e+02 8.00000000e+01 ... 4.35097259e+01
  1.21319135e+00 2.10000000e+01]
 [8.00000000e+00 9.20000000e+01 9.30000000e+01 ... 2.12405757e+01
  1.58364981e-01 2.30000000e+01]
 [7.00000000e+00 1.15000000e+02 4.70000000e+01 ... 4.15115235e+01
  7.90185680e-02 2.30000000e+01]
 ...
 [0.00000000e+00 9.30000000e+01 8.90000000e+01 ... 1.86906831e+01
  4.27048955e-01 2.40000000e+01]
 [0.00000000e+00 1.32000000e+02 9.80000000e+01 ... 1.97916451e+01
  3.02257208e-01 2.30000000e+01]
 [3.00000000e+00 1.14000000e+02 6.50000000e+01 ... 3.62154365e+01
  1.47362850e-01 3.40000000e+01]] <bound method IndexOpsMixin.value_counts of 0        0
1        0
2        0
3        1
4        0
        ..
14995    1
14996    1
14997    0
14998    0
14999    1
Name: Diabetic, Length: 15000, dtype: int64>
