In [1]:
# @title Question 1
import numpy as np
from sklearn.linear_model import LogisticRegression

# Generating synthetic data
np.random.seed(0)
X_data = np.random.randn(100, 6)  # 100 samples, 6 features
y_data = np.random.randint(0, 2, 100)  # Binary target 0/1

# Training an initial logistic regression model on the original data
original_model = LogisticRegression()
original_model.fit(X_data, y_data)
initial_weights = original_model.coef_[0]

# Duplicating the last feature in the dataset
X_modified = np.column_stack((X_data, X_data[:, -1]))

# Training a new logistic regression model on the modified data
modified_model = LogisticRegression()
modified_model.fit(X_modified, y_data)
modified_weights = modified_model.coef_[0]

print("Initial Weights:", initial_weights)
print("Modified Weights:", modified_weights)


Initial Weights: [ 0.30115786  0.12276031 -0.11469102 -0.11556772 -0.13229105  0.41705124]
Modified Weights: [ 0.30189883  0.12218828 -0.11557031 -0.11498384 -0.13195805  0.2143759
  0.2143759 ]


Created a logistic regression model

Observations:

The weights from w_0 to w_n-1 for original and new model are approximately equal
For w_n (original) & (w_n_new, w_n+1_new): approx relation is w_n = w_n_new + w_n+1_new

In [3]:
# @title Question 2
from scipy.stats import chi2_contingency

# Click-through data (as given in %) [clicks, no-clicks] and 1000 templates of each are sent
clicks_A = 100
no_clicks_A = 900

clicks_B = 70
no_clicks_B = 930

clicks_C = 85
no_clicks_C = 915

clicks_D = 120
no_clicks_D = 880

clicks_E = 140
no_clicks_E = 860

# Function to perform chi-squared test and return p-value
def calculate_p_value(control_clicks, control_no_clicks, other_clicks, other_no_clicks):
    data = [[control_clicks, control_no_clicks], [other_clicks, other_no_clicks]]
    _, p_value, _, _ = chi2_contingency(data)
    return p_value

# Calculate p-values for each template compared to A
p_values_dict = {
    "B vs A": calculate_p_value(clicks_A, no_clicks_A, clicks_B, no_clicks_B),
    "C vs A": calculate_p_value(clicks_A, no_clicks_A, clicks_C, no_clicks_C),
    "D vs A": calculate_p_value(clicks_A, no_clicks_A, clicks_D, no_clicks_D),
    "E vs A": calculate_p_value(clicks_A, no_clicks_A, clicks_E, no_clicks_E)
}

p_values_dict


{'B vs A': 0.020060502655718262,
 'C vs A': 0.2799261382501793,
 'D vs A': 0.17451579008209805,
 'E vs A': 0.007283436889671482}

Template A gets 10% click through rate (CTR), B gets 7% CTR, C gets 8.5% CTR, D gets 12% CTR and E gets 14% CTR.

Using: Chi-squared test for proportions

Null hypothesis H0 for each comparison is that there is no difference between the CTRs of the control template (A) and the other templates (B, C, D, E).

Alternative hypothesis Ha is that there is a difference.

We are looking for a 95% confidence level to reject the null hypothesis. (p-value: 0.05)

p-value for BvsA - 0.02 < 0.05 Reject H0 => there is a difference (B is worse than A)

p-value for CvsA - 0.28 > 0.05 Fail to reject H0 => there is no difference between A & C

p-value for DvsA - 0.17 > 0.05 Fail to reject H0 => there is no difference between A & D

p-value for EvsA - 0.007 < 0.05 Reject H0 => there is a difference (E is better than A)

b. E is better than A with over 95% confidence, B is worse than A with over 95% confidence. You need to run the test for longer to tell where C and D compare to A with 95% confidence.

In [None]:
# @title Question 3


Approximate computational cost of each gradient descent iteration of logistic regression:

The gradient of cost function wrt to weights is: ∇J(θ)= (1/m)(h(Xθ)−y)X

Computational Cost: Matrix Multiplication ( Xθ ): Since each row of  X  has, on average,  k  non-zero entries, and there are  m  such rows, the total number of non-zero entries that need to be considered in the multiplication is about  mk . For sparse matrix-vector multiplication, the cost is proportional to the number of non-zero elements, so the cost is  O(mk) .

Computing the Hypothesis ( h(Xθ) ): This step involves applying the sigmoid function to each of the  m  results of the matrix-vector multiplication. The cost is  O(m) .

Gradient Calculation ( XT(h(Xθ)−y) ): Again, the key part of this computation is the matrix-vector multiplication involving the sparse matrix  XT . The number of non-zero elements remains the same as in the first step, so the cost is also  O(mk) .

Overall Computational Cost:  O(mk)

In [None]:
# @title Question 4


Expected Ranking in Terms of Accuracy:

Method 2 (Random Labeled Stories): By training on a diverse and representative sample, V2 is likely to develop a more general understanding of the categories across different news sources. This method is expected to yield the highest accuracy in classifying a broad range of articles.

Method 3 (Wrong and Farthest from the Decision Boundary): Learning from the most confident mistakes of V1 could address specific weaknesses but this will lack generality and can lead to overfitting, might make this approach less effective.

Method 1 (Closest to the Decision Boundary): Although valuable for fine-tuning the decision boundary, the focus on edge cases might not contribute as significantly to overall accuracy improvement compared to the other method

In [None]:
# @title Question 5


Maximum Likelihood Estimate (MLE):

The MLE for a binomial distribution (which is appropriate for coin tosses) is simply the ratio of the number of successes (heads) to the total number of trials. Therefore, the MLE of p is k/n

Bayesian Estimate with Uniform Prior: For the Bayesian estimate, we start with a uniform prior for p over [0, 1]. This is equivalent to a Beta distribution with parameters α = 1, β = 1

Now for k heads in n tosses, the posterior distribution for p is a Beta distribution with α = k+1, β = n-k+1

Bayesian Estimate of p is  (k+1)/(n+2)

Maximum a Posteriori (MAP) Estimate with Uniform Prior: k/n (assuming k>0 and n-k>0)