When fitting a logistic regression model, we aim to find the parameter values $\beta$ that maximize the log-likelihood function (or equivalently, minimize the negative log-likelihood). 

The optimization problem can be formulated as:

$$\hat{\beta} = \arg\max_{\beta} \ell(\beta)$$

Where $\ell(\beta)$ is the log-likelihood function:

$$\ell(\beta) = \sum_{i=1}^{n} \left[Y_i \beta^T X_i - \log(1 + e^{\beta^T X_i})\right]$$

Equivalently, we can minimize the negative log-likelihood:

$$\hat{\beta} = \arg\min_{\beta} -\ell(\beta)$$

$$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} \left[-Y_i \beta^T X_i + \log(1 + e^{\beta^T X_i})\right]$$

This is often referred to as minimizing the cross-entropy loss or log loss. In practice, we may also add a regularization term to prevent overfitting:

$$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} \left[-Y_i \beta^T X_i + \log(1 + e^{\beta^T X_i})\right] + \lambda \|\beta\|^2$$

Where $\lambda \|\beta\|^2$ is the L2 regularization term (also known as ridge regularization), and $\lambda$ is a hyperparameter that controls the strength of regularization.

Since this optimization problem doesn't have a closed-form solution, we typically use iterative optimization algorithms such as Gradient Descent, Newton's Method, or variants like Stochastic Gradient Descent to find the optimal parameters.

The binary cross-entropy loss function (negative log-likelihood) for logistic regression is:

$$\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

Where:
- $y$ is the true binary label (0 or 1)
- $\hat{y}$ is the predicted probability that $y = 1$
- $\log$ is the natural logarithm

For a dataset with $n$ samples, the average loss is:

$$\mathcal{L}(\hat{y}, y) = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

Several optimization procedures can be used to solve the logistic regression optimization problem:

1. **Gradient Descent**: An iterative algorithm that updates parameters in the direction of the negative gradient of the cost function.
   
   $\beta^{(t+1)} = \beta^{(t)} - \alpha \nabla_\beta J(\beta^{(t)})$
   
   where $\alpha$ is the learning rate and $\nabla_\beta J(\beta)$ is the gradient of the cost function with respect to $\beta$.

2. **Stochastic Gradient Descent (SGD)**: A variant of gradient descent that uses a single randomly selected training example to compute the gradient at each iteration, making it more computationally efficient for large datasets.

3. **Mini-batch Gradient Descent**: A compromise between batch gradient descent and SGD that uses a small random subset of training examples to compute the gradient at each iteration.

4. **Newton's Method**: A second-order optimization method that uses both the gradient and the Hessian matrix of the cost function.
   
   $\beta^{(t+1)} = \beta^{(t)} - [H_\beta J(\beta^{(t)})]^{-1} \nabla_\beta J(\beta^{(t)})$
   
   where $H_\beta J(\beta)$ is the Hessian matrix of the cost function.

5. **Quasi-Newton Methods**: Methods like BFGS and L-BFGS that approximate the Hessian matrix to avoid the computational cost of calculating and inverting it directly.

6. **Coordinate Descent**: Optimizes one parameter at a time while holding others constant, cycling through all parameters until convergence.

7. **Conjugate Gradient**: An algorithm that generates a sequence of search directions that are conjugate with respect to the Hessian matrix.

Each method has its trade-offs in terms of computational efficiency, memory requirements, and convergence properties. The choice depends on factors such as dataset size, feature dimensionality, and available computational resources.

These parameters define the relationship between the input features $X = (X_1, X_2, \ldots, X_p)$ and the log-odds of the positive class:

$\log\left(\frac{P(Y=1|X)}{P(Y=0|X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p = \beta^T X$

The coefficients have the following interpretation:
- $\beta_0$ represents the log-odds of the positive class when all features are zero
- Each $\beta_j$ for $j \geq 1$ represents the change in log-odds when the corresponding feature $X_j$ increases by one unit, holding all other features constant

Let's start with the log likelihood function for logistic regression:

$\ell(\beta) = \sum_{i=1}^{n} \left[Y_i \beta^T X_i - \log(1 + e^{\beta^T X_i})\right]$

To find the gradient, we need to compute the partial derivatives with respect to each parameter $\beta_j$:

$\frac{\partial \ell(\beta)}{\partial \beta_j} = \frac{\partial}{\partial \beta_j} \sum_{i=1}^{n} \left[Y_i \beta^T X_i - \log(1 + e^{\beta^T X_i})\right]$

$= \sum_{i=1}^{n} \frac{\partial}{\partial \beta_j} \left[Y_i \beta^T X_i - \log(1 + e^{\beta^T X_i})\right]$

Let's compute each term separately:

1. First term: $\frac{\partial}{\partial \beta_j} (Y_i \beta^T X_i) = \frac{\partial}{\partial \beta_j} (Y_i \sum_{k=0}^{p} \beta_k X_{ik}) = Y_i X_{ij}$

2. Second term: $\frac{\partial}{\partial \beta_j} \log(1 + e^{\beta^T X_i})$

Using the chain rule:
$\frac{\partial}{\partial \beta_j} \log(1 + e^{\beta^T X_i}) = \frac{1}{1 + e^{\beta^T X_i}} \cdot \frac{\partial}{\partial \beta_j}(1 + e^{\beta^T X_i}) = \frac{1}{1 + e^{\beta^T X_i}} \cdot e^{\beta^T X_i} \cdot X_{ij}$

$= \frac{e^{\beta^T X_i}}{1 + e^{\beta^T X_i}} \cdot X_{ij}$

Note that $\frac{e^{\beta^T X_i}}{1 + e^{\beta^T X_i}} = \frac{1}{1 + e^{-\beta^T X_i}} = p(X_i)$, which is the predicted probability that $Y_i = 1$.

Combining the terms:

$\frac{\partial \ell(\beta)}{\partial \beta_j} = \sum_{i=1}^{n} \left[Y_i X_{ij} - \frac{e^{\beta^T X_i}}{1 + e^{\beta^T X_i}} X_{ij}\right]$

$= \sum_{i=1}^{n} \left[Y_i - p(X_i)\right] X_{ij}$

Therefore, the gradient of the log likelihood with respect to the parameter vector $\beta$ is:

$\nabla_\beta \ell(\beta) = \sum_{i=1}^{n} \left[Y_i - p(X_i)\right] X_i$

In vector form, this can be written as:

$\nabla_\beta \ell(\beta) = X^T (Y - p)$

where:
- $X$ is the $n \times (p+1)$ design matrix with rows $X_i^T$
- $Y$ is the $n \times 1$ vector of observed outcomes
- $p$ is the $n \times 1$ vector of predicted probabilities $p(X_i)$

This gradient is used in gradient-based optimization methods to find the parameters that maximize the log likelihood.