<a href="https://colab.research.google.com/github/bbash/Machine-Learning-/blob/main/Binary%20Classification%20Problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Binary Classification Problem

Solving using
1. Logistics Regression
2. Support vector machine
3. Naive Bayes
4. K-NN

## Data source : https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("magic04.data")

In [3]:
data.head()

Unnamed: 0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
1,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
4,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g


In [4]:
# include columns title
cols = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]

data.columns = cols

In [5]:
data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
1,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
4,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g


In [6]:
# checking for missing values and data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19019 entries, 0 to 19018
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   fLength   19019 non-null  float64
 1   fWidth    19019 non-null  float64
 2   fSize     19019 non-null  float64
 3   fConc     19019 non-null  float64
 4   fConc1    19019 non-null  float64
 5   fAsym     19019 non-null  float64
 6   fM3Long   19019 non-null  float64
 7   fM3Trans  19019 non-null  float64
 8   fAlpha    19019 non-null  float64
 9   fDist     19019 non-null  float64
 10  class     19019 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.6+ MB


In [7]:
data['class'].unique()

array(['g', 'h'], dtype=object)

In [8]:
# coverting the entrie of class cloumns to 1 if class entry is g and 0 otherwise. 
data['class'] = (data['class'] == 'g').astype(int)

In [9]:
data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,1
1,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,1
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,1
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,1
4,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,1


**Split data into training set, validation set and test dataset**

In [10]:
train, valid, test = np.split(data.sample(frac = 1), [int(0.6*len(data)), int(0.8*len(data))])

In [11]:
print(train.shape)
print(valid.shape)
print(test.shape)

(11411, 11)
(3804, 11)
(3804, 11)


**Feature scaling**

In [12]:
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

def feature_scale(dataframe,oversample=False):
  x = dataframe[dataframe.columns[ : -1]].values
  y = dataframe[dataframe.columns[-1]].values

  scaler = StandardScaler()
  X = scaler.fit_transform(x)

  if oversample:
    ros = RandomOverSampler()
    X, y = ros.fit_resample(X, y)

  data = np.hstack((X, np.reshape(y, (-1,1))))

  return data, X, y



In [13]:
# checking ovesample
print(len(train[train['class'] == 1]))
print(len(train[train['class'] == 0]))

7413
3998


In [14]:
train, X_train, y_train = feature_scale(train, oversample = True)
valid, X_valid, y_valid = feature_scale(valid, oversample = False)
test, X_test, y_test = feature_scale(test, oversample = False)

IMPLEMENTATION OF THE K-NN ALGORITHM