## Quick review from last lecture

* We saw examples of continuous and discrete functions
* We took their derivatives
* We plotted these
* We used an example from Wikipedia to study wheat yields with a logistic (or sigmoid) function model
* We ended with a prescriptivist look at logistic regression from a "canonical reference" according to GPT-4, https://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf 


## Today's lecture 

* Taking a prescriptivist stance involves reducing autonomy, limiting our thinking and the symbols we use to communicate concepts very clearly and precisely to aid learning. 
* The risk is that we avoid thinking of the controversies involved: https://www.jstor.org/stable/3541653 (from Ismael), https://www.colorado.edu/amath/sites/default/files/attached-files/vallverdu08.pdf

* Logistic regression recap
* Embeddings from the perspective of logistic regression 
* Embeddings from the perspective of linear algebra

## Background: what does "expected value" mean in the Mcullagh and Nelder book and how does it relate to logistic regression?

* The expected value of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff ("value").
* For example, if we roll a fair six-sided die, the expected value of the die roll is 3.5, because the average of all the numbers that come up is 3.5.
* The expected value is also known as the expectation, mathematical expectation, EV, average, mean value, mean, or first moment.

The LaTeX formula for the expected value is:

$$\mathbb{E}(X) = \sum_{i=1}^n x_i p_i$$

where $X$ is the random variable, $x_i$ is the value of $X$ for the $i$-th outcome, and $p_i$ is the probability of that outcome.

* The expected value of a random variable is denoted as $\mathbb{E}(X)$ or $\mu_X$.

As an example, this formula can be used to calculate the expected value of rolling a fair six-sided die:

$$\mathbb{E}(X) = \sum_{i=1}^6 x_i p_i = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5$$

The expected value for an un-fair die can be calculated similarly:

$$\mathbb{E}(X) = \sum_{i=1}^6 x_i p_i = 1 \cdot \frac{1}{8} + 2 \cdot \frac{1}{8} + 3 \cdot \frac{1}{8} + 4 \cdot \frac{1}{8} + 5 \cdot \frac{1}{8} + 6 \cdot \frac{3}{8} = 4.25$$

As another example, the expected value in roulette can be calculated as follows:

$$\mathbb{E}(X) = \sum_{i=1}^n x_i p_i = 0 \cdot \frac{1}{38} + 1 \cdot \frac{1}{38} + 2 \cdot \frac{1}{38} + \cdots + 36 \cdot \frac{1}{38} + 37 \cdot \frac{1}{38} = ?$$

* The expected value of a random variable is a key aspect of the probability distribution of the variable.
* It is a measure of the "center" of the distribution.
* The expected value is the value that one would "expect" to get (or that we would get) if we repeated the experiment many times and took the average of the values obtained.
* More formally, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity.

## Background: probability mass functions

The probability mass function (PMF) of a discrete random variable $X$ is a function that gives the probability that $X$ is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.

## Background: probability density functions

The probability density function (PDF) of a continuous random variable $X$ is the derivative of the cumulative distribution function (CDF) of $X$.

## Background: cumulative distribution functions

The cumulative distribution function (CDF) of a random variable $X$ is the probability that $X$ will take a value less than or equal to $x$.

## Background: domain

The domain of a function is the set of all possible inputs for the function. For example, the domain of $f(x) = x^2$ is all real numbers, and the domain of the square root function is all non-negative real numbers.



## Example of a probability mass function for a discrete random variable

An example of a probability mass function is the binomial distribution, which gives the probability of obtaining exactly $k$ successes in $n$ independent Bernoulli trials with probability $p$ of success on each trial. The binomial distribution's PMF is:

$$f(k;n,p) = \Pr(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

for $k = 0, 1, 2, ..., n$, where $\binom{n}{k}$ is the binomial coefficient, hence the name of the distribution.

We can plot the binomial distribution's PMF for different values of $n$ and $p$ using altair:

```python
import altair as alt
import pandas as pd
import numpy as np

def binomial_pmf(k, n, p):
    return np.math.factorial(n) / (np.math.factorial(k) * np.math.factorial(n - k)) * p**k * (1 - p)**(n - k)

k = np.arange(0, 21)
n = np.arange(1, 11)
p = np.arange(0.1, 1.1, 0.1)

df = pd.DataFrame({
    'k': np.repeat(k, len(n) * len(p)),
    'n': np.tile(np.repeat(n, len(p)), len(k)),
    'p': np.tile(p, len(k) * len(n))
})

df['pmf'] = df.apply(lambda row: binomial_pmf(row['k'], row['n'], row['p']), axis=1)

alt.Chart(df).mark_line().encode(
    x='k',
    y='pmf',
    color='n',
    column='p'
)
```

## Example of a probability density function for a continuous random variable
An example of a probability density function is the normal distribution, which gives the probability density of observing a value $x$ from a normal distribution with mean $\mu$ and standard deviation $\sigma$. The normal distribution's PDF is:

$$f(x;\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

for $-\infty < x < \infty$, where $\pi$ is the constant pi, $e$ is the constant e, and $\sqrt{2\pi\sigma^2}$ is the square root of $2\pi\sigma^2$.

We can plot the normal distribution's PDF for different values of $\mu$ and $\sigma$ using altair:

```python
import altair as alt
import pandas as pd
import numpy as np

def normal_pdf(x, mu, sigma):
    return 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))

x = np.linspace(-5, 5, 1000)
df = pd.DataFrame({'x': x, 'y': normal_pdf(x, 0, 1)})
alt.Chart(df).mark_line().encode(x='x', y='y')
```



In [10]:
## COpilot -- FAIL
import altair as alt
import pandas as pd
import numpy as np

def binomial_pmf(k: int, n: int, p: float) -> float:
    k = int(k)
    n = int(n)
    p = float(p)
    return np.math.factorial(n) / (np.math.factorial(k) * np.math.factorial(n - k)) * p**k * (1 - p)**(n - k)

k = np.arange(0, 21, dtype=int)
n = np.arange(1, 11, dtype=int)
p = np.arange(0.1, 1.1, 0.1)

df = pd.DataFrame({
    'k': np.repeat(k, len(n) * len(p)),
    'n': np.tile(np.repeat(n, len(p)), len(k)),
    'p': np.tile(p, len(k) * len(n))
})

df['pmf'] = df.apply(lambda row: binomial_pmf(row['k'], row['n'], row['p']), axis=1)

alt.Chart(df).mark_line().encode(
    x='k',
    y='pmf',
    color='n',
    column='p'
)

ValueError: factorial() not defined for negative values

In [9]:

def binomial_pmf(k: int, n: int, p: float) -> float:
    return np.math.factorial(n) / (np.math.factorial(k) * np.math.factorial(n - k)) * p**k * (1 - p)**(n - k)

binomial_pmf(3, 5, 0.5)

0.3125

In [11]:
## Claude - FAIL
import altair as alt
import numpy as np 

n = 10  # number of trials
p = 0.5 # probability of success

x = np.arange(n+1)  # x-axis values, 0 to n
y = []              # y-axis values (probabilities)

# calculate binomial PMF 
for i in x: 
    y.append((factorial(n) / (factorial(i) * factorial(n-i))) * (p**i) * ((1-p)**(n-i)))

# create Altair chart 
chart = alt.Chart(pd.DataFrame({'x': x, 'y': y})).mark_bar().encode(
    x='x', 
    y='y',
    tooltip=['x', 'y']  # show tooltip with x and y values 
)

# display the chart 
chart.show()

NameError: name 'factorial' is not defined

In [14]:
## GPT-4
import numpy as np
import pandas as pd
import altair as alt

n = 10
p = 0.5

# Generate the possible outcomes (from 0 to n)
x = np.arange(n+1)

# Calculate the PMF for each outcome
pmf = np.array([np.math.comb(n, k) * (p**k) * ((1-p)**(n-k)) for k in x])

# Create a DataFrame
df = pd.DataFrame({
    'Outcome - k in the formula': x,
    'PMF': pmf
})

# Create the Altair chart
chart = alt.Chart(df).mark_bar().encode(
    x='Outcome - k in the formula:O',
    y='PMF:Q'
)

chart


In [6]:
import altair as alt
import pandas as pd
import numpy as np

def normal_pdf(x, mu, sigma):
    return 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))

x = np.linspace(-5, 5, 1000)
df = pd.DataFrame({'x': x, 'y': normal_pdf(x, 0.5, 1)})
alt.Chart(df).mark_line().encode(x='x', y='y')

## Visualizing logistic regression from a generalized linear model perspective

please create visualizations in altair for a jupyter notebook in the python programming language for the following key concepts in generalized linear models for logistic regression specifically

> `What is the canonical reference for generalized linear model (GLM)`

https://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf 


In the McCullagh and Nelder formulation, the generalized linear model is defined in terms of a systematic component and a random component. Logistic regression is a type of generalized linear model (GLM) where the outcome is binary. 

The systematic component for a binary logistic regression model can be written as:

$$
\eta = X\beta
$$

where X is the design matrix, β is the vector of parameters to be estimated, and η is the linear predictor.

The logistic regression model specifically uses the logit link function, which connects the systematic and random components. This function is the natural logarithm of the odds, and can be written as:

$$ 
g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) = \eta
$$

where μ represents the expected value of the response variable.

Therefore, the logistic regression model can be written as:

$$
\log\left(\frac{\mu}{1 - \mu}\right) = X\beta
$$

where:
- μ is the expected value of the response variable, given the predictors
- X is the matrix of predictors
- β is the vector of parameters to be estimated

Finally, solving the equation for μ gives the standard form of the logistic regression model:

$$
\mu = \frac{1}{1 + \exp(-X\beta)}
$$

This gives the probability of the positive class in a binary logistic regression model.

Sure, below is the corresponding raw LaTeX for the code you provided. This includes the equations used in the code, and not the code itself, as LaTeX is not intended for representing programming code. 

The linear predictor, the logistic function, and the logistic regression model can be represented as follows in LaTeX:

**1. The Linear Predictor:**

This is simply a linear combination of the predictors.

$$
\eta = \beta_0 + \beta_1 x
$$
Where:
- `\eta` is the linear predictor
- `\beta_0` is the intercept
- `\beta_1` is the coefficient for the predictor `x`

**2. The Logistic Function:**

This maps the linear predictor onto the interval (0, 1) as a probability. 

$$
\mu = \frac{1}{1 + e^{-\eta}}
$$
Where:
- `\mu` is the probability of the positive class
- `\eta` is the linear predictor

**3. The Logistic Regression Model:**

The logistic regression model combines these two components.

$$
\mu = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}
$$
Where:
- `\mu` is the probability of the positive class
- `\beta_0` is the intercept
- `\beta_1` is the coefficient for the predictor `x`

In [16]:
import numpy as np
import pandas as pd
import altair as alt

# Define the parameters for our model
beta = np.array([1, 2])

# Create a grid of x values
x = np.linspace(-10, 10, 100)

# Create a DataFrame with the x values and their corresponding eta values
df = pd.DataFrame({
    'x': x,
    'eta': x * beta[0] + beta[1]
})

# Create the Altair chart
chart = alt.Chart(df).mark_line().encode(
    x='x',
    y='eta'
)

chart


## Example for chat messages

Sure, let's consider the following hypothetical scenario:

We are building a logistic regression model to predict whether a chat message is spam (1) or not spam (0) based on the number of links in the message. We have two parameters, the intercept `β0` and the coefficient for the number of links `β1`.

We can represent the key components of the logistic regression model with this specific example as follows:

**1. The Linear Predictor:**

This is the linear combination of the predictor (number of links). Suppose our intercept `β0` is -3 and our coefficient `β1` is 2.5.

$$
\eta = \beta_0 + \beta_1 * \text{{number of links}} = -3 + 2.5 * \text{{number of links}}
$$
**2. The Logistic Function:**

This maps the linear predictor onto the interval (0, 1) as a probability.

$$
\mu = \frac{1}{1 + e^{-\eta}} = \frac{1}{1 + e^{-(-3 + 2.5 * \text{{number of links}})}}
$$
**3. The Logistic Regression Model:**

The logistic regression model combines these two components, predicting the probability that a message is spam based on the number of links it contains.

$$
\mu = \frac{1}{1 + e^{-(-3 + 2.5 * \text{{number of links}})}}
$$
So, in this case, `μ` represents the predicted probability that a message is spam, given the number of links in the message.

## Example of being ultra-prescriptive to connect the historical symbols/words/concepts in Mcullaugh and Nelder to the modern symbols/words/concepts in the lecture note

> Sorry this isn't quite the same formalism. We need to be ultra-prescriptive here so please try again using the same mathematical notation, symbols, and style of this lecture note from previous class that you generated for us: 

Apologies for the confusion earlier. Let's use a more prescriptive approach by working through a specific example, using the same notation and symbols from the lecture note.

Suppose we want to build a model to predict if a chat message is spam (1) or not spam (0) based on the number of links (`x1`) and the length of the message in characters (`x2`). Our design matrix `X` will then include these two features, and our vector `β` of parameters will contain two coefficients, `β1` and `β2`, which we wish to estimate.

The systematic component (linear predictor `η`) can be written as:

$$
\eta = X\beta = \beta_0 + \beta_1x_1 + \beta_2x_2
$$
where:

- `x1` is the number of links in the chat message
- `x2` is the length of the message
- `β0` is the intercept
- `β1` is the coefficient for the number of links (`x1`)
- `β2` is the coefficient for the message length (`x2`)

The logistic regression model uses the logit link function, which connects the systematic and random components:

$$
g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) = \eta = \beta_0 + \beta_1x_1 + \beta_2x_2
$$
where `μ` represents the expected value (probability of being spam) given the predictors (number of links and message length).

Therefore, the logistic regression model can be written as:

$$
\log\left(\frac{\mu}{1 - \mu}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2
$$
where:

- `μ` is the expected value (probability of being spam), given the predictors
- `x1` and `x2` are our predictors (number of links and message length, respectively)
- `β0`, `β1`, and `β2` are the parameters to be estimated

Finally, solving this equation for `μ` gives the standard form of the logistic regression model:

$$
\mu = \frac{1}{1 + \exp(-(\beta_0 + \beta_1x_1 + \beta_2x_2))}
$$
This equation will give us the probability that a chat message is spam, given its number of links and its length.

Here are the steps to solving this 

## Aside: steps to solving for the link function to get to he standard formr of logistic regression
First, we start with the logit link function:

\begin{equation}
\log\left(\frac{\mu}{1 - \mu}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2
\end{equation}

We then want to solve this equation for $\mu$.

Step 1: Take the exponential of both sides to remove the natural logarithm on the left side:

\begin{equation}
\frac{\mu}{1 - \mu} = \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)
\end{equation}

Step 2: Multiply both sides by $1 - \mu$ to isolate $\mu$ on one side:

\begin{equation}
\mu = (1 - \mu) \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)
\end{equation}

Step 3: Distribute the right-hand side and bring all terms involving $\mu$ to the left side:

\begin{equation}
\mu = \exp(\beta_0 + \beta_1x_1 + \beta_2x_2) - \mu \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)
\end{equation}

Step 4: Simplify by collecting terms with $\mu$ on one side:

\begin{equation}
\mu (1 + \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)) = \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)
\end{equation}

Step 5: Finally, solve for $\mu$:

\begin{equation}
\mu = \frac{\exp(\beta_0 + \beta_1x_1 + \beta_2x_2)}{1 + \exp(\beta_0 + \beta_1x_1 + \beta_2x_2)}
\end{equation}

This simplifies to the standard form of the logistic regression model:

\begin{equation}
\mu = \frac{1}{1 + \exp(-(\beta_0 + \beta_1x_1 + \beta_2x_2))}
\end{equation}



## Example on chat logs for the class

-- figure out which columns/types are in a parquet file

In [22]:
!duckdb :memory: "DESCRIBE SELECT * FROM './data/datathinking.zulipchat.com/processed/zerver_message.parquet'"

┌────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│      column_name       │ column_type │  null   │   key   │ default │  extra  │
│        varchar         │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id                     │ UBIGINT     │ YES     │         │         │         │
│ sender                 │ UBIGINT     │ YES     │         │         │         │
│ recipient              │ BIGINT      │ YES     │         │         │         │
│ realm                  │ UBIGINT     │ YES     │         │         │         │
│ subject                │ VARCHAR     │ YES     │         │         │         │
│ content                │ VARCHAR     │ YES     │         │         │         │
│ rendered_content       │ VARCHAR     │ YES     │         │         │         │
│ rendered_content_ver…  │ UBIGINT     │ YES     │         │         │         │
│ date_sent              │ V

In [23]:
!duckdb :memory: "DESCRIBE SELECT * FROM './data/datathinking.zulipchat.com/processed/zerver_usermessage.parquet'"

┌──────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name  │ column_type │  null   │   key   │ default │  extra  │
│   varchar    │   varchar   │ varchar │ varchar │ varchar │ varchar │
├──────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id           │ UBIGINT     │ YES     │         │         │         │
│ user_profile │ UBIGINT     │ YES     │         │         │         │
│ message      │ UBIGINT     │ YES     │         │         │         │
│ flags_mask   │ BOOLEAN     │ YES     │         │         │         │
└──────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘


## Example on chat logs for the class

please generate code using jax to fit a logistic regression model using the above example and the same notation and variable names using the following schema for this parquet file that contains the data:

!duckdb :memory: "DESCRIBE SELECT * FROM './data/datathinking.zulipchat.com/processed/zerver_usermessage.parquet'"

```

 column_name  │ column_type │  null   │   key   │ default │  extra  │
│   varchar    │   varchar   │ varchar │ varchar │ varchar │ varchar │
├──────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id           │ UBIGINT     │ YES     │         │         │         │
│ user_profile │ UBIGINT     │ YES     │         │         │         │
│ message      │ UBIGINT     │ YES     │         │         │         │
│ flags_mask   │ BOOLEAN     │ YES     │         │         │   
```

Based on your question, it seems like you want to use the fields user_profile and message as the predictors (features) and flags_mask as the binary target variable. For simplicity, let's assume flags_mask is our binary target with 1 indicating spam and 0 indicating not spam.

Here is the Python code using JAX to fit a logistic regression model. Please note that the JAX library primarily works with arrays, not data frames, so the input data will need to be converted into this format.

Please adjust learning_rate and epochs as needed. In addition, please install necessary dependencies (if not installed) before running this code. The above example assumes a very simple logistic regression model and does not include any feature scaling or selection, regularization, cross-validation, or other techniques that may be appropriate for a real-world application.

Please replace the path to the parquet file with your actual path and ensure that your environment has enough memory to hold the data.

In [24]:
import jax
import jax.numpy as np
from jax import random
from jax.scipy.special import expit as sigmoid
from jax import grad, jit, vmap
from jax import lax

import pandas as pd
import pyarrow.parquet as pq

# Read parquet file
df = pd.read_parquet('./data/datathinking.zulipchat.com/processed/zerver_usermessage.parquet')

# Ensure correct data types
df = df.astype({"user_profile": 'float32', "message": 'float32', "flags_mask": 'int32'})

# Define features and target
X = df[['user_profile', 'message']].values
y = df['flags_mask'].values

# Initialize parameters randomly
key = random.PRNGKey(0)
key, W_key, b_key = random.split(key, 3)
W = random.normal(W_key, (2,))
b = random.normal(b_key, ())

# Sigmoid function
def sigmoid(x):
    return 0.5 * (np.tanh(x / 2) + 1)

# Predict function
def predict(W, b, inputs):
    return sigmoid(np.dot(inputs, W) + b)

# Loss function
def loss(W, b, inputs, targets):
    preds = predict(W, b, inputs)
    label_probs = preds * targets + (1 - preds) * (1 - targets)
    return -np.sum(np.log(label_probs))

# Compute gradient of loss function
grad_loss = jit(grad(loss, argnums=(0, 1)))

# Training loop
epochs = 10
learning_rate = 0.001
for i in range(epochs):
    dW, db = grad_loss(W, b, X, y)
    W -= learning_rate * dW
    b -= learning_rate * db

# Print final parameters
print("W:", W)
print("b:", b)


W: [nan nan]
b: nan


In [25]:
df

Unnamed: 0,id,user_profile,message,flags_mask
0,8257034355,544719.0,299488064.0,1
1,8257034356,544719.0,299488064.0,1
2,8257034357,544719.0,299488064.0,1
3,9332100997,587794.0,327255232.0,1
4,9332100998,544719.0,327255232.0,1
...,...,...,...,...
20325,9829132736,589802.0,342317088.0,0
20326,9829132737,590314.0,342317088.0,0
20327,9829132738,591344.0,342317088.0,0
20328,9829132739,590323.0,342317088.0,0


## This was incorrect because I didn't read that `message` is a `BIGINT` datatype instead of a string like we need

Had to remind GPT-4 to follow the example using prompts like `awesome. can you please rewrite to be as close to this example as possible and use the same notation as the latex you helped us generate for the math?` many times to get the following.

In [29]:
import jax
import jax.numpy as np
from jax import random
from jax.scipy.special import expit as sigmoid
from jax import grad, jit, vmap
from jax import lax

import pandas as pd
import pyarrow.parquet as pq

# Read parquet file
df = pd.read_parquet('./data/datathinking.zulipchat.com/processed/zerver_message.parquet')

# Initialize parameters randomly
key = random.PRNGKey(0)
key, W_key, b_key = random.split(key, 3)
beta = random.normal(W_key, (2,))  # Parameter vector β
beta_0 = random.normal(b_key, ())  # Intercept β0

# Assume number of links and length of message are our predictors
# For simplicity, let's generate these columns randomly as the actual computation might be more complex
df['x1'] = random.randint(key=key, minval=0, maxval=5, shape=df.shape[0])  # Number of links
df['x2'] = df['content'].str.len()  # Length of message

# Ensure correct data types
df = df.astype({"x1": 'float32', "x2": 'float32', "has_link": 'int32'})

# Define features and target
X = df[['x1', 'x2']].values
y = df['has_link'].values

# Predict function
def predict(beta, beta_0, inputs):
    return sigmoid(np.dot(inputs, beta) + beta_0)

# Loss function
def loss(beta, beta_0, inputs, targets):
    preds = predict(beta, beta_0, inputs)
    label_probs = preds * targets + (1 - preds) * (1 - targets)
    return -np.sum(np.log(label_probs))

# Compute gradient of loss function
grad_loss = jit(grad(loss, argnums=(0, 1)))

# Training loop
epochs = 10
learning_rate = 0.001
for i in range(epochs):
    dbeta, dbeta_0 = grad_loss(beta, beta_0, X, y)
    beta -= learning_rate * dbeta
    beta_0 -= learning_rate * dbeta_0

# Print final parameters
print("beta:", beta)
print("beta_0:", beta_0)


TypeError: 'int' object is not iterable