**Logistic Regression**
- Used for supervised learning (When using labeled data - you are performing supervised learning, in all other cases - unsupervised one (clustering))

- Classification model (diabetes, spam emails, user is fake or not)

- Best for binary classification (you can also do class classification problems, but you can would need to add some more features to it)

- Uses sigmoid function


Sigmoid curve equation

The logistic regression model predicts the probability that a given input $\mathbf{x}$ belongs to a particular class (e.g., class 1). The formula for the logistic regression model is:

$$
P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}
$$

Where:  
- $P(y=1|\mathbf{x})$ is the probability that the dependent variable $y$ equals 1 given the input $\mathbf{x}$.  
- $\beta_0$ is the intercept.  
- $\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients for the independent variables $x_1, x_2, \ldots, x_n$.  
- $e$ is the base of the natural logarithm.


In other words the formula for can be represented as following:
$$
P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(z)}}
$$

Where $z$

$$
z = wX + b
$$

Derivatives:
First derivative
$$
dw = \frac{1}{m} \sum_{i=1}^{m} ( \hat{Y}i - Y_i ) X_i
$$

Second derivative
$$
dw = \frac{1}{m} \sum_{i=1}^{m} ( \hat{Y}i - Y_i )
$$

Gradient Descent:
$$
w = w - ɑ*dw
$$

$$
b = b - ɑ*db
$$

In [25]:
import numpy as np
import tensorflow as tf
import os

In [26]:
images_dir = "./cell_images"
parasitized_images = "./cell_images/Parasitized"
uninfected_data = "./cell_images/Uninfected"

In [27]:
def count_files_in_directories(root_dir):
  for dirpath, dirnames, filesnames in os.walk(root_dir):
    print(f"{dirpath}: {len(filesnames)} files")

In [28]:
count_files_in_directories(parasitized_images)

./cell_images/Parasitized: 13780 files


In [29]:
count_files_in_directories(uninfected_data)

./cell_images/Uninfected: 13780 files


In [30]:
CONFIGURATION = {
    "BATCH_SIZE": 32,
    "IM_SIZE": 64,
    "IM_DIVISION": 255.0,
    "N_EPOCHS": 20,
    "LEARNING_RATE": 1e-4,
    "N_DENSE_1": 1024,
    "N_DENSE_2": 128,
    "CLASS_NAMES": ["parasitized","uninfected"],
    "SPLIT": 0.2,
    "CLASS_NUM": 2,
}

In [31]:
# Load the dataset
train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    images_dir,
    labels='inferred',
    label_mode='int',
    image_size=(CONFIGURATION["IM_SIZE"], CONFIGURATION["IM_SIZE"]),
    batch_size=CONFIGURATION["BATCH_SIZE"],
    validation_split=CONFIGURATION["SPLIT"],
    subset='training',
    seed=123
)

val_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    images_dir,
    labels='inferred',
    label_mode='int',
    image_size=(CONFIGURATION["IM_SIZE"], CONFIGURATION["IM_SIZE"]),
    batch_size=CONFIGURATION["BATCH_SIZE"],
    validation_split=CONFIGURATION["SPLIT"],
    subset='validation',
    seed=123
)



Found 27558 files belonging to 2 classes.
Using 22047 files for training.
Found 27558 files belonging to 2 classes.
Using 5511 files for validation.


In [32]:
# Normalize the images
def normalize(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

train_dataset = train_dataset.map(normalize)
val_dataset = val_dataset.map(normalize)

# Prefetch the datasets for better performance
train_dataset = train_dataset.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)
val_dataset = val_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)


In [33]:
# Convert the TensorFlow dataset to NumPy arrays
def dataset_to_numpy(dataset):
    images = []
    labels = []
    for image, label in dataset:
        images.append(image.numpy())
        labels.append(label.numpy())
      # Concatenate all batches into single NumPy arrays
    images = np.concatenate(images, axis=0)
    labels = np.concatenate(labels, axis=0)
    return np.array(images), np.array(labels)

X_train, Y_train = dataset_to_numpy(train_dataset)
X_val, Y_val = dataset_to_numpy(val_dataset)

# Flatten the images
X_train = X_train.reshape(X_train.shape[0], -1)
X_val = X_val.reshape(X_val.shape[0], -1)

In [34]:
class LogisticRegression():
    # declaring learning rate & number of iterations (Hyper parameters)
    def __init__(self,learning_rate = 0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations

        
    # fit function to train the model with dataset 
    # X and Y are matrixes
    def fit(self,X,Y):

        # number of data points in the dataset (number of rows) --> m
        # number of input features in the dataset (number of columns) --> n
        self.m, self.n = X.shape 

        # initiating weight and bias values
        self.w = np.zeros(self.n)
        self.b = 0

        self.X = X
        self.Y = Y

        # implement Gradient Descent for Optimization
        for i in range(self.n_iterations):
            self.__update_weights()

    def __update_weights(self):
        # Y_hat formula (sigmoid function)
        # np.dot - matrix multiplication
        # z = xw + b
        z = self.X.dot(self.w) + self.b
        Y_hat = 1 / (1 + np.exp(-z))

        # derivatives (gradients)
        # m - number of samples (rows)
        # n - number of features (columns) 
        # use X transpose because
        # X dimensions m x n
        # w dimensions n x 1
        # Y dimensions m x 1
        # Y_hat dimensions m x 1
        dw = (1/self.m) * np.dot(self.X.T, (Y_hat - self.Y) )
        # np.sum - matrix summation
        db = (1/self.m) * np.sum(Y_hat - self.Y)

        # Gradient Descent 
        # updating wights and bias using Gradient Descent equation
        # w = w - a*dw
        # b = b - a*db
        self.w = self.w - self.learning_rate * dw
        self.b = self.b - self.learning_rate * db


    # Sigmoid Equation & Decision Boundary
    def predict(self, X):
        # Sigmoid Equation
        z = X.dot(self.w) + self.b
        Y_pred = 1 / (1 + np.exp(-z))
        Y_pred = np.where( Y_pred > 0.5, 1, 0)
        return Y_pred

In [35]:
model = LogisticRegression()

In [38]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [39]:
# TODO how can i even use my custom LogisticRegression on malaria dataset
model.fit(X_train,Y_train)

In [40]:
Y_pred = model.predict(X_val)

In [62]:
def binary_cross_entropy_loss(Y_val,Y_pred, epsilon=1e-15):

    Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
    Y_val = np.clip(Y_pred, epsilon, 1 - epsilon)

    loss = - np.mean(Y_val * np.log(Y_pred) + (1 - Y_val) * np.log(1 - Y_pred))
    return loss


In [63]:
loss = binary_cross_entropy_loss(Y_val=Y_val, Y_pred=Y_pred)

In [64]:
print(loss)

3.5535700627826245e-14
