<a href="https://colab.research.google.com/github/adeyemi-dev/notezipper/blob/master/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
# importing libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


Logistic Regression

In [2]:
class Logistic_Regression ():


  # declaring learning rate and number of iteration(Hyperparameters)
  def __init__(self,learning_rate,no_of_iteration):

    self.learning_rate = learning_rate
    self.no_of_iteration = no_of_iteration


  # fit the function to train our model with the dataset 
  def fit(self,X,Y):

    # no of data points in the dataset (number of rows) ==> m
    # no of input features in the dataset (number of columns )==> n
    self.m , self.n = X.shape

    #initiate the weight and bais value 
    self.w = np.zeros(self.n)
    self.b = 0

    self.X = X
    self.Y = Y

    # implementing gradient descent 
    for i in range(self.no_of_iteration):
      self.update_weight()


  def update_weight(self):

    # we need the formular for y_hat first (sigma function)

    Y_hat = 1 / (1 + np.exp( - (self.X.dot(self.w) + self.b) ))  # wx + b

    # for the derivatives

    dw = (1/self.m)*np.dot(self.X.T, (Y_hat - self.Y)) # taking the transpose number of column of one matrix should match no of row in the next
                                                       # X = [769 x 8]  Y = [769 x 1]

    db = (1/self.m)*np.sum(Y_hat - self.Y)
    
    # updating the weight and bais using gradient descent equation

    self.w = self.w - self.learning_rate * dw

    self.b = self.b - self.learning_rate * db

    # Sigmoid Equation & Decision Boundary

  def predict(self, X):  
    
    Y_pred =  1 / (1 + np.exp( - (X.dot(self.w) + self.b) )) 
    Y_pred = np.where(Y_pred > 0.5 , 1 , 0)
    return Y_pred






Data collection and Processing


In [3]:
 diabetes_data = pd.read_csv('/content/diabetes.csv')
 

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# print first 5 rows of the dataset
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# print the last 5 rows in the dataset
diabetes_data.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [6]:
# number of rows and colunm 
diabetes_data.shape

(768, 9)

In [7]:
# info about the data 
diabetes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [8]:
# checking if our data has some missing values 
diabetes_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [9]:
# statistical measure about the data 
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [11]:
# checking the distribution of target varaible 
diabetes_data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

1 --> represent presence of Heart Disease 
0 --> represent  no presence of Heart Disease


Splitting the Features and Target

In [13]:
X = diabetes_data.drop(columns='Outcome',axis=1)
Y = diabetes_data['Outcome']

In [13]:
print(X)

      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0      52    1   0       125   212    0        1      168      0      1.0   
1      53    1   0       140   203    1        0      155      1      3.1   
2      70    1   0       145   174    0        1      125      1      2.6   
3      61    1   0       148   203    0        1      161      0      0.0   
4      62    0   0       138   294    1        1      106      0      1.9   
...   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
1020   59    1   1       140   221    0        1      164      1      0.0   
1021   60    1   0       125   258    0        0      141      1      2.8   
1022   47    1   0       110   275    0        0      118      1      1.0   
1023   50    0   0       110   254    0        0      159      0      0.0   
1024   54    1   0       120   188    0        1      113      0      1.4   

      slope  ca  thal  
0         2   2     3  
1         0   0     3  
2  

In [14]:
print(Y)

0       0
1       0
2       0
3       0
4       0
       ..
1020    1
1021    0
1022    0
1023    1
1024    0
Name: target, Length: 1025, dtype: int64


Standardizatio of Data


In [15]:
scaler = StandardScaler()

In [17]:
scaler.fit(X)

StandardScaler()

In [18]:
standardized_data = scaler.transform(X)

In [19]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [20]:
X = standardized_data
Y = diabetes_data['Outcome']


In [21]:
print(X)
print(Y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


Splitting our Data into Training data and Test data 

In [22]:
X_train, X_test , Y_train , Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y , random_state=2)

In [23]:
print(X.shape, X_train.shape,X_test.shape)

(768, 8) (614, 8) (154, 8)


Model Training

Logistic Regression Model

In [24]:
model = Logistic_Regression(learning_rate=0.01, no_of_iteration=1000)

Training the Logiistic Regression Model with the training data to find some pattern

In [25]:
model.fit(X_train,Y_train)

Model Evaluation 

Accuracy score

In [26]:
# accuracy on training data
#comparing the predicted value and the normal value
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [28]:
print("Accuracy on Training data : ", training_data_accuracy)

Accuracy on Training data :  0.7785016286644951


In [29]:
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [30]:
print("Accuracy on Test data : ", test_data_accuracy)

Accuracy on Test data :  0.7597402597402597


from the accuracy score we can tell our model didnt overfit 
overfit is when the acc score of the training data is very high and the test data acc score is very low but in pur case it quite close to each other 

Bulid our Predictive System

In [31]:
input_data = (5,166,72,19,175,25.8,0.587,51)

# change the input data to numpy array before we can make prediction on it
input_data_as_numpy_array = np.asarray(input_data)

# reshape the numpy array as we are predicting for only one instance 
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)

print(prediction )

if (prediction[0] == 0):
  print('The person does not have Heart Disease')
else:
  print('The person has Heart Disease')  



[1]
The person has Heart Disease
