# Logistic regression
Logistic regression is a popular statistical modeling technique used to analyze and model relationships between a binary response variable (i.e., a variable that takes only two values, such as 0 or 1) and one or more predictor variables.
                                                                                             

                                                                                    
In logistic regression, the relationship between the predictor variables and the response variable is modeled using a logistic function, which maps a continuous range of predictor values to a range of probabilities for the response variable. The logistic function is a type of S-shaped curve that allows for the modeling of non-linear relationships between the predictor variables and the response variable.

Logistic regression is widely used in many fields, including healthcare, marketing, finance, and social sciences. It is particularly useful for predicting the probability of an event or outcome, such as the likelihood of a customer making a purchase or a patient developing a disease.

In Jupiter Notebook, logistic regression can be implemented using various libraries such as scikit-learn, statsmodels, and tensorflow. These libraries provide easy-to-use functions for fitting logistic regression models and making predictions based on the model.

# The formula for logistic regression

The formula for logistic regression is as follows:

P(Y=1|X) = 1 / (1 + e^(-Xβ))

Where:

P(Y=1|X) is the probability of the response variable Y taking the value 1 given the predictor variables X.
X is a vector of predictor variables.
β is a vector of coefficients that represent the effect of each predictor variable on the response variable.
e is the base of the natural logarithm, approximately equal to 2.718.
The logistic regression formula models the log-odds (logit) of the probability of the response variable Y taking the value 1 as a linear function of the predictor variables X. The logit function is then transformed using the sigmoid function, which maps the log-odds to a probability between 0 and 1. This allows the logistic regression model to estimate the probability of the response variable taking the value 1 for a given set of predictor variables.

# The sigmoid function
The sigmoid function is a mathematical function that maps any input value to a value between 0 and 1. It is commonly used in logistic regression to transform the log-odds (logit) of the probability of the response variable taking the value 1 into a probability value between 0 and 1.

The formula for the sigmoid function is as follows:

σ(z) = 1 / (1 + e^(-z))

Where:

σ(z) is the output of the sigmoid function.
z is the input to the sigmoid function.
The sigmoid function is a type of logistic function, and it has an S-shaped curve. As the input value z increases, the output value of the sigmoid function approaches 1. As the input value z decreases, the output value of the sigmoid function approaches 0. At an input value of 0, the output value of the sigmoid function is 0.5.

In logistic regression, the input value z is the log-odds (logit) of the probability of the response variable taking the value 1 given the predictor variables. The sigmoid function is used to transform the log-odds to a probability value between 0 and 1.

# Gradient descent
Gradient descent is an optimization algorithm that is commonly used in machine learning to find the optimal parameters for a given model. In logistic regression, gradient descent is used to find the optimal weights that minimize the cost function.

# Learning rate

Learning rate is a hyperparameter that determines the step size at each iteration while moving toward a minimum of a loss function. The learning rate is an important parameter to set correctly because it can affect how quickly or slowly the model converges to the optimal parameters.

importing the dependencies

In [158]:
# importinğ the  lib
import numpy as np

In [159]:
class logistic_regression():
  #declaring the learnig rate and no of iteration(hyperparameters)
  def __init__(self,learning_rate,no_of_iterations):
    self.learning_rate = learning_rate
    self.no_of_iterations = no_of_iterations
  # fit function to train the model with dataset
  def fit(self,X,Y): 
    # number of training example & no_of_iterations

    self.m,self.n = X.shape  # number of rows & columns
    #rows, columns
    # number of weigth and bias
    self.w = np.zeros(self.n)
    self.b = 0 

    self.X=X
    self.Y=Y
    # implementing Gradient Descent
    for  i in range(self.no_of_iterations):
      self.update_weights()
  def update_weights(self):
    # y_hat formaula (sigmoid function)
    y_hat = 1/(1+np.exp(-(self.X.dot(self.w)+self.b)))
    #derivaties
    dw=(1/self.m)*np.dot(self.X.T,(y_hat - self.Y))
    db=(1/self.m)*np.sum(y_hat - self.Y)
    #udating the weigth and bias vlaues uing gradient descent
    self.w = self.w - self.learning_rate * dw
    self.b = self.b - self.learning_rate * db
    # sigmoid equation & decision bounties 
  def predict(self,X):
    y_pred = 1/(1+np.exp(-(self.X.dot(self.w)+self.b)))
    y_pred = np.where(y_pred >0.5 ,1, 0)
    return y_pred
    


In [160]:
import pandas as pd

Data collection and analysis

PIMA diabetes dataset

In [161]:
# lording the diabets dataset to pandas dataframe
df=pd.read_csv("diabetes.csv")

In [162]:
#print top 5 rows
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [163]:
#printing the bottom 5 rows
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


importing the libs


In [164]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [165]:
# number of rows and columns 
df.shape

(768, 9)

In [166]:
# getting the statistical measures of the dataset
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [167]:
df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

o ->non-diabetes 
1 -> diabetic 

In [168]:
df.groupby("Outcome").mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [169]:
# separting the freatue and labels
feature = df.drop(columns="Outcome",axis=1)
target = df["Outcome"]

In [170]:
feature

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [171]:
target


0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

# data standardizatin

In [172]:
scaler= StandardScaler()

In [173]:
scaler.fit(feature)

In [174]:
standardized_data = scaler.transform(feature)

In [175]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [176]:
feature = standardized_data
target= df["Outcome"]

In [177]:
feature 

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [178]:
x_train,x_test,y_train,y_test=train_test_split(feature,target,test_size=0.2,random_state=2)

In [179]:
print(feature.shape,x_train.shape,x_test.shape)

(768, 8) (614, 8) (154, 8)


# training the model

In [180]:
classifier=logistic_regression(learning_rate=0.01,no_of_iterations=1000)

In [181]:
classifier.fit(x_train,y_train)

# model evaluation 

In [182]:
# accuracy score on train data
x_train_prediction = classifier.predict(x_train)
test_data_accuracy = accuracy_score(y_train,x_train_prediction)

In [183]:
x_train_prediction.shape , y_train.shape , x_train.shape

((614,), (614,), (614, 8))

In [184]:
print("accuracy score of the training data : ", test_data_accuracy )

accuracy score of the training data :  0.7768729641693811


In [189]:
# accuracy score on test data
x_test_prediction = classifier.predict(x_test)
# test_data_accuracy = accuracy_score(y_test,x_train_prediction)

In [190]:
x_test_prediction.shape , y_test.shape , x_test.shape

((614,), (154,), (154, 8))

# making a predictive system

In [193]:

input_data = (5,166,72,19,175,25.8,0.587,51)
# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)

# reshape the array as we are predicting for ane instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prdiction = classifier.predict(std_data)
print(prdiction)

if (prdiction[0] == 0):
    print("the person is not diabetic")
else:
    print("the person is diabetic")
    

[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0
 1 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1
 1 0 0 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0
 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0
 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1
 1 0 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0
 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
 0 0 1 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 

