# **Q/A Assignement**

1) You train Logistic Regression with a certain set of features and learn weights $w_0$, $w_1$ till $w_n$.
Feature $n$ gets weight $w_n$ at the end of training. Say you now create a new dataset where you duplicate feature $n$ into feature $(n+1)$ and retrain a new model. Suppose this new model weights are $w_{new_0}$, $w_{new_1}$ till $w_{new_n}$, $w_{new_{n+1}}$. What is the likely relationship between $w_{new_0}$, $w_{new_1}$ , $w_{new_n}$,  and $w_{new_{n+1}}$?


Answer - In Logistic Regression, if we duplicate a feature, the weights of the original and duplicated features will likely be distributed between the two. This is because the two features are perfectly correlated, and the model can distribute the weight between them while still achieving the same result.

So, if wn was the weight of the original feature n, after duplication, the weights $w_{new_n}$ and $w_{new_n+1}$ of the original and duplicated features will likely sum up to $w_{new_n}$.

This is not a strict rule, as the exact distribution can depend on the specifics of the training process, but it's a common outcome.

Duplicating features doesn't add any new information to the model, and can lead to issues like multicollinearity, which can make the model's estimates less stable.

2. You currently have an email marketing template A and you want to replace it with a better template. A is the control_template. You also test email templates B, C, D, E. You send exactly 1000 emails of each template to different random users. You wish to figure out what email gets the highest click through rate. Template A gets 10% click through rate (CTR), B gets 7% CTR, C gets 8.5% CTR, D gets 12% CTR and E gets 14% CTR. You want to run your multivariate test till you get 95% confidence in a conclusion. Which of the following is true?
    1. We have too little data to conclude that A is better or worse than any other template with 95% confidence.
    2. E is better than A with over 95% confidence, B is worse than A with over 95% confidence. You need to run the test for longer to tell where C and D compare to A with 95% confidence.
    3. Both D and E are better than A with 95% confidence. Both B and C are worse than A with over 95% confidence

Ans - Option 2. E is better than A with over 95% confidence. B is worse than A with over 95% confidence. You need to run the test for longer to tell where C and D compare to A with 95% confidence.

3) You have $m$ training examples and $n$ features. Your feature vectors are however sparse and average number of non-zero entries in each train example is $k$ and $k << n$. What is the approximate computational cost of each gradient descent iteration of logistic regression in modern well written packages?

Ans- The computational cost of each gradient descent iteration of logistic regression is typically proportional to the number of non-zero entries in the feature vectors. This is because modern packages like Scikit-learn etc, take advantage of sparse data structures and operations.

In this case, if we have $m$ training examples and each example has on average $k$ non-zero entries, then the computational cost of each gradient descent iteration is approximately $O(m*k)$.

This is significantly less than the $O(m*n)$ cost that would be incurred if all entries in the feature vectors were non-zero and the sparsity of the data was not exploited.



4) We are interested in building a high quality text classifier that categorizes news stories into 2 categories - information and entertainment. We want the classifier to stick with predicting the better among these two categories (this classifier won't try to predict a percent score for these two categories). You have already trained V1 of a classifier with 10,000 news stories from the New York Times, which is one of 1000 new sources we would like the next version of our classifier (let's call it V2) to correctly categorize stories for. You would like to train a new classifier with the original 10,000 New York Times news stories and an additional 10,000 different news stories and no more. Below are approaches to generating the additional 10,000 pieces of train data for training V2.

    1. Run our V1 classifier on 1 Million random stories from the 1000 news sources. Get the 10k stories where the V1 classifier’s output is closest to the decision boundary and get these examples labeled.
    2. Get 10k random labeled stories from the 1000 news sources we care about.
    3. Pick a random sample of 1 million stories from 1000 news sources and have them labeled. Pick the subset of 10k stories where the V1 classifier’s output is both wrong and farthest away from the decision boundary.
    
  Ignore the difference in costs and effort in obtaining train data using the different methods described above. In terms of pure accuracy of classifier V2 when classifying a bag of new articles from 1000 news sources, what is likely to be the value of these different methods?How do you think the models will rank based on their accuracy?

Ans-

In terms of pure accuracy, the ranking will be: 2 > 3 > 1.

**Running V1 classifier on 1 Million random stories and picking 10k stories closest to the decision boundary**: This approach is likely to improve the classifier's performance on ambiguous cases that lie near the decision boundary. However, it might not significantly improve the overall accuracy as it doesn't necessarily cover a wide range of examples.

**Getting 10k random labeled stories from the 1000 news sources**: This approach provides a more diverse set of training examples, which can help the classifier generalize better to unseen data. It's likely to improve the overall accuracy more than the first approach.

**Picking a random sample of 1 million stories and selecting 10k stories where V1 classifier’s output is wrong and farthest from the decision boundary**: This approach focuses on the most challenging examples for the current classifier. It's likely to improve the classifier's performance on these difficult cases, but it might not generalize well to typical cases.

5) You wish to estimate the probability, $p$ that a coin will come up heads, since it may not be a fair coin. You toss the coin $n$ times and it comes up heads $k$ times. You use the following three methods to estimate $p$
      
      1. Maximum Likelihood estimate (MLE)
      2. Bayesian Estimate: Here you assume a continuous distribution uniform prior to $p$ from $[0,1]$ (i.e. the probability density function for the value of $p$ is uniformly $1$ inside this range and $0$ outside. Our estimate for $p$ will be the expected value of the posterior distribution of $p$. The posterior distribution is conditioned on these observations.
      3. Maximum a posteriori (MAP) estimate: Here you assume that the prior is the same as (b). But we are interested in the value of $p$ that corresponds to the mode of the posterior distribution.
    
  What are the estimates?

Ans -
(a) As the probability $p$ will follow Bernoulli's distribution. Now for its pmf we can write the likelihood function of it and take the log of it and finally if we take the derivative of the whole we will get $p=K/N$. Now, we take the double derivative and verify it is a maxima. Hence, $p=K/N$.

(b)

# Coding Assignment: Implementation and Optimization of GPT-2 Model

**TASK1**

In [28]:
import torch
from torch import nn
import math

class GPT2Config:
    def __init__(self, vocab_size=50257, n_positions=1024, n_ctx=1024, n_embd=768,
                 n_layer=12, n_head=12, layer_norm_epsilon=1e-5, initializer_range=0.02,
                 embd_pdrop=0.1, resid_pdrop=0.1, attn_pdrop=0.1, bos_token_id=50256,
                 eos_token_id=50256, model_type="gpt2", activation_function="gelu_new",
                 architectures=["GPT2LMHeadModel"], n_inner=None,
                 reorder_and_upcast_attn=False, scale_attn_by_inverse_layer_idx=False,
                 scale_attn_weights=True, summary_activation=None,
                 summary_first_dropout=0.1, summary_proj_to_labels=True,
                 summary_type="cls_index", summary_use_proj=True,
                 task_specific_params={"text-generation": {"do_sample": True, "max_length": 50}},
                 transformers_version="4.35.2", use_cache=True):
        self.vocab_size = vocab_size
        self.n_positions = n_positions
        self.n_ctx = n_ctx
        self.n_embd = n_embd
        self.n_layer = n_layer
        self.n_head = n_head
        self.layer_norm_epsilon = layer_norm_epsilon
        self.initializer_range = initializer_range
        self.embd_pdrop = embd_pdrop
        self.resid_pdrop = resid_pdrop
        self.attn_pdrop = attn_pdrop
        self.bos_token_id = bos_token_id
        self.eos_token_id = eos_token_id
        self.model_type = model_type
        self.activation_function = activation_function
        self.architectures = architectures
        self.n_inner = n_inner
        self.reorder_and_upcast_attn = reorder_and_upcast_attn
        self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
        self.scale_attn_weights = scale_attn_weights
        self.summary_activation = summary_activation
        self.summary_first_dropout = summary_first_dropout
        self.summary_proj_to_labels = summary_proj_to_labels
        self.summary_type = summary_type
        self.summary_use_proj = summary_use_proj
        self.task_specific_params = task_specific_params
        self.transformers_version = transformers_version
        self.use_cache = use_cache

def scaled_dot_product_attention(q, k, v, attention_mask=None, dropout=None):
    matmul_qk = torch.matmul(q, k.transpose(-1, -2))
    dk = k.size(-1)
    scaled_attention_logits = matmul_qk / math.sqrt(dk)

    if attention_mask is not None:
        scaled_attention_logits += (attention_mask * -1e9)

    attention_weights = nn.Softmax(dim=-1)(scaled_attention_logits)
    if dropout is not None:
        attention_weights = dropout(attention_weights)

    output = torch.matmul(attention_weights, v)
    return output, attention_weights


class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super(MultiHeadAttention, self).__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = nn.Dropout(config.attn_pdrop)
        assert self.n_embd % self.n_head == 0

        self.wq = nn.Linear(self.n_embd, self.n_embd)
        self.wk = nn.Linear(self.n_embd, self.n_embd)
        self.wv = nn.Linear(self.n_embd, self.n_embd)
        self.dense = nn.Linear(self.n_embd, self.n_embd)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.n_head, self.n_embd // self.n_head)
        return x.permute(0, 2, 1, 3)

    def forward(self, v, k, q, mask):
        batch_size = q.size(0)

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        scaled_attention, _ = scaled_dot_product_attention(q, k, v, mask, self.dropout)
        scaled_attention = scaled_attention.permute(0, 2, 1, 3)

        concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.n_embd)
        output = self.dense(concat_attention)
        return output


class TransformerBlock(nn.Module):
    def __init__(self, config):
        super(TransformerBlock, self).__init__()
        self.attention = MultiHeadAttention(config)
        self.norm1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
        self.norm2 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
        self.ffn = nn.Sequential(
            nn.Linear(config.n_embd, config.n_embd * 4),
            nn.GELU(),
            nn.Linear(config.n_embd * 4, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )
        self.dropout = nn.Dropout(config.resid_pdrop)

    def forward(self, x, mask):
        attn_output = self.attention(x, x, x, mask)
        x = x + self.dropout(attn_output)
        x = self.norm1(x)

        ffn_output = self.ffn(x)
        x = x + self.dropout(ffn_output)
        x = self.norm2(x)
        return x


class GPT2(nn.Module):
    def __init__(self, config):
        super(GPT2, self).__init__()
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        self.drop = nn.Dropout(config.embd_pdrop)
        self.h = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
        self.fc = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=config.initializer_range)
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, input_ids, attention_mask=None):
        position_ids = torch.arange(0, input_ids.size(-1), dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)

        input_embeds = self.wte(input_ids)
        position_embeds = self.wpe(position_ids)

        hidden_states = input_embeds + position_embeds
        hidden_states = self.drop(hidden_states)

        for block in self.h:
            hidden_states = block(hidden_states, attention_mask)

        hidden_states = self.ln_f(hidden_states)
        logits = self.fc(hidden_states)

        return logits

In [30]:
# model = GPT2(GPT2Config())
# model.load_state_dict(torch.load("gpt2_weights.pth"))

# input_ids = torch.tensor([[50256, 345, 262, 263, 1818, 287, 262, 2635, 11, 290, 262, 1898, 287, 7526, 11, 423, 262, 2635, 13, 198, 198, 198, 40, 858, 262, 1578, 764, 11, 290, 262, 1898, 287, 7526, 764]])
# attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

# with torch.no_grad():
#     outputs = model(input_ids, attention_mask=attention_mask)

In [34]:
# from transformers import GPT2Config

# # Get the configuration of the pre-trained model
# config = GPT2Config.from_pretrained('gpt2')
# print(config)

# #(self, vocab_size=50257, n_positions=1024, n_ctx=1024, n_embd=768,
#                 #  n_layer=12, n_head=12, layer_norm_epsilon=1e-05, initializer_range=0.02,
#                 #  embd_pdrop=0.1, resid_pdrop=0.1):

**TASK 2**
