# Job Interview Guide — Workshop

*Generated on 2025-10-17 17:47:07*

This workshop has two parts:
1. A transcript with the prompts and the full quiz session.
2. Hands-on exercises for regression metrics and model evaluation.


## Transcript

Below is the content you requested. It captures the prompts and the full quiz session.

### Initial task prompt from user

You are a **senior Data Scientist and Machine Learning Engineer** who also acts as a **technical interviewer**.  
I am a **junior Data Scientist and Machine Learning Engineer**, recently graduated and preparing for an **ML Specialist interview**.

You will guide me through an interactive study and evaluation process.

Two files are attached:

* **StudyGuide.txt** – contains the study guides.
* **StudyMaterials.zip** – contains the HTML version Jupyter Notebook workshops.

Your tasks:

1. **Analyze the materials**

   * Unzip and review the Jupyter Notebooks.
   * Create a **500-word summary** explaining the main learning content and technical focus.

2. **Cross-reference with the study guides**

   * Unzip and review the Study Guide.
   * Match each study guide topic with related content in the Jupyter Notebooks.
   * Produce a **comparison table** showing whether each topic is covered or missing.

3. **Create a quiz**

   * Write **15 multiple-choice questions** (A–E) based on the combined knowledge from both sources.
   * Each question should test understanding relevant to a real-world ML interview.

4. **Interactive quiz mode**

   * Do **not** produce the quiz immediately.
   * Instead, start a quiz session and ask **one question at a time**.
   * I will answer each question.
   * Track my score.
   * After the quiz ends, show my **total score** and generate a **Jupyter Notebook with exercises** targeting my weak areas.


### User: Start quiz

**User:** Yes, start the quiz session now



### Quiz session

**Q1.** Purpose of the validation dataset?  
A. Train the model by updating parameters  
B. Evaluate the model on unseen data for the first time  
C. Tune hyperparameters and choose the best model configuration  
D. Measure model drift after deployment  
E. Store outliers removed from training data  

**User answer:** B  
**Correct:** C  
**Explanation:** Validation is for model selection and hyperparameter tuning. The test set is for first unbiased evaluation.

---

**Q2.** Best description of a non-parametric model?  
A. Fixed number of parameters independent of data size  
B. Assumes predefined functional form  
C. Learns patterns directly from data without a fixed structure  
D. Always faster to train  
E. Always interpretable and less prone to overfitting  

**User answer:** C  
**Correct:** C  
**Explanation:** Non-parametric models adapt to data without a fixed functional form.

---

**Q3.** Role of learning rate in Gradient Descent?  
A. Determines epochs  
B. Controls step size for parameter updates  
C. Defines number of parameters  
D. Measures curvature of loss  
E. Acts as regularizer  

**User answer:** B  
**Correct:** B  
**Explanation:** Step size per update.

---

**Q4.** Appropriate loss for classification?  
A. MSE  
B. MAE  
C. Cross-Entropy  
D. Huber  
E. MAPE  

**User answer:** C  
**Correct:** C  
**Explanation:** Cross-Entropy penalizes confident wrong predictions.

---

**Q5.** High R² means?  
A. Low variance and overfitting  
B. Model explains most target variance  
C. Poor generalization  
D. Residuals are uncorrelated  
E. Loss reached a local minimum  

**User answer:** A  
**Correct:** B  
**Explanation:** R² measures variance explained. It does not by itself indicate overfitting.

---

**Q6.** OLS is:  
A. Iterative gradient update algorithm  
B. Closed-form method minimizing squared errors  
C. Probabilistic model for binary classification  
D. Minimizes absolute deviations  
E. Cross-validation strategy  

**User answer:** B  
**Correct:** B

---

**Q7.** Key advantage of parametric models:  
A. Always outperform non-parametric models  
B. No assumptions about data  
C. Simple, interpretable, and efficient  
D. Automatically adapt to nonlinear data  
E. Avoid underfitting regardless of complexity  

**User answer:** C  
**Correct:** C

---

**Q8.** Purpose of k-fold CV repeats:  
A. Reduce bias by growing training set  
B. Ensure all data points are used for training and validation  
C. Find optimal learning rate automatically  
D. Prevent fast convergence  
E. Balance classes during training  

**User answer:** B  
**Correct:** B

---

**Q9.** Main goal of optimization:  
A. Select features automatically  
B. Find parameters that minimize loss  
C. Normalize inputs  
D. Compute confidence intervals  
E. Improve interpretability via regularization  

**User answer:** B  
**Correct:** B

---

**Q10.** Most robust to outliers in regression:  
A. MSE  
B. MAE  
C. RMSE  
D. R²  
E. MAPE  

**User answer:** A  
**Correct:** B  
**Explanation:** MAE does not square errors, so outliers have less impact.

---

**Q11.** Primary limitation of Gradient Descent vs OLS:  
A. Cannot handle large datasets  
B. Requires iterative updates and may converge slowly  
C. Cannot apply to nonlinear models  
D. Does not compute gradients  
E. Non-deterministic results  

**User answer:** B  
**Correct:** B

---

**Q12.** Role of sigmoid in Logistic Regression:  
A. Create polynomial features  
B. Normalize features  
C. Map real numbers to probabilities in [0,1]  
D. Measure loss  
E. Balance classes  

**User answer:** C  
**Correct:** C

---

**Q13.** Early stopping means:  
A. Stop at 100% test accuracy  
B. Stop when validation performance stops improving  
C. Stop after fixed epochs  
D. Stop when training loss is lowest  
E. Stop when gradient is zero  

**User answer:** B  
**Correct:** B

---

**Q14.** True about R²:  
A. Measures average absolute error  
B. Increases as irrelevant features are added  
C. Compares model vs predicting the mean  
D. Only for classification  
E. Can never be negative  

**User answer:** C  
**Correct:** C

---

**Q15.** Self-supervised learning is:  
A. Learning with no feedback  
B. Learning with labeled data  
C. Predicting masked or missing parts of input as supervision  
D. Rewards or penalties for actions  
E. Reinforcement from human evaluators  

**User answer:** C  
**Correct:** C

**Final score:** 12 / 15


### Follow-up prompts

- **User:** Yes, proceed  
- **Assistant:** Generated targeted exercises notebook  
- **User:** Generate the notebook again. Call it "JobInterviewGuide_Workshop.ipynb". Include markdown with the contents of the entire quiz session and all the prompts I issued herein.


## Workshop Exercises: Regression Metrics and Model Evaluation

You will practice:
- MAE vs MSE robustness to outliers.
- R² by formula and with a mean baseline.
- Proper train, validation, and test flow.
- Validation curve for polynomial degree.
- 10-fold cross-validation stability.


### Environment setup

Install if needed:
```
pip install numpy pandas scikit-learn matplotlib
```


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.random.seed(42)


### Data generation with injected outliers

In [None]:

# Base linear relationship
n = 300
X = np.linspace(-4, 4, n).reshape(-1, 1)
y_true = 3.0 * X.squeeze() + 5.0
noise = np.random.normal(0, 1.0, size=n)
y = y_true + noise

# Inject outliers
outlier_idx = np.random.choice(np.arange(n), size=6, replace=False)
y[outlier_idx] += np.random.choice([15, -15, 20, -20], size=6)

# DataFrame
df = pd.DataFrame({"x": X.squeeze(), "y": y, "is_outlier": 0})
df.loc[outlier_idx, "is_outlier"] = 1

df.head()


### Plot: data with outliers highlighted

In [None]:

plt.figure()
plt.scatter(df["x"], df["y"], s=12, label="data")
plt.scatter(df.loc[df.is_outlier==1, "x"], df.loc[df.is_outlier==1, "y"], s=40, marker="x", label="outliers")
plt.title("Synthetic data with injected outliers")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()


### Exercise 1. MAE vs MSE robustness to outliers

In [None]:

# Fit model on all data
X_all, y_all = df[["x"]].values, df["y"].values
model = LinearRegression()
model.fit(X_all, y_all)
y_pred_all = model.predict(X_all)

mae_all = mean_absolute_error(y_all, y_pred_all)
mse_all = mean_squared_error(y_all, y_pred_all)
rmse_all = mean_squared_error(y_all, y_pred_all, squared=False)

# Remove outliers and recompute
mask = df["is_outlier"] == 0
X_clean, y_clean = df.loc[mask, ["x"]].values, df.loc[mask, "y"].values
model_clean = LinearRegression().fit(X_clean, y_clean)
y_pred_clean = model_clean.predict(X_clean)

mae_clean = mean_absolute_error(y_clean, y_pred_clean)
mse_clean = mean_squared_error(y_clean, y_pred_clean)
rmse_clean = mean_squared_error(y_clean, y_pred_clean, squared=False)

summary = pd.DataFrame({
    "metric": ["MAE", "MSE", "RMSE"],
    "with_outliers": [mae_all, mse_all, rmse_all],
    "without_outliers": [mae_clean, mse_clean, rmse_clean]
})
summary["pct_change_when_removing_outliers"] = 100 * (summary["with_outliers"] - summary["without_outliers"]) / summary["with_outliers"]
summary.round(3)


### Plot residuals: with and without outliers

In [None]:

res_all = y_all - y_pred_all
res_clean = y_clean - y_pred_clean

plt.figure()
plt.hist(res_all, bins=30)
plt.title("Residuals with outliers")
plt.xlabel("residual")
plt.ylabel("count")
plt.show()

plt.figure()
plt.hist(res_clean, bins=30)
plt.title("Residuals without outliers")
plt.xlabel("residual")
plt.ylabel("count")
plt.show()


**Guiding questions**
- Which metric changed the most after removing outliers, MAE or MSE or RMSE?
- Which metric is more robust to outliers, and why?


### Exercise 2. R² interpretation and manual computation

In [None]:

# Split
X_train, X_test, y_train, y_test = train_test_split(df[["x"]].values, df["y"].values, test_size=0.25, random_state=7)

# Fit
lin = LinearRegression().fit(X_train, y_train)

# Library R²
yhat_test = lin.predict(X_test)
r2_lib = r2_score(y_test, yhat_test)

# Manual R²
ybar = np.mean(y_test)
ss_res = np.sum((y_test - yhat_test)**2)
ss_tot = np.sum((y_test - ybar)**2)
r2_manual = 1 - ss_res / ss_tot

# Baseline: mean predictor
yhat_base = np.full_like(y_test, fill_value=np.mean(y_train))
r2_base = r2_score(y_test, yhat_base)

pd.DataFrame({
    "metric": ["R2_library", "R2_manual", "R2_baseline_mean"],
    "value": [r2_lib, r2_manual, r2_base]
}).round(4)


### Plot: predictions vs ground truth on test set

In [None]:

plt.figure()
plt.scatter(X_test.squeeze(), y_test, s=12, label="y_test")
plt.scatter(X_test.squeeze(), yhat_test, s=12, label="y_pred")
plt.title("Test set: true vs predicted")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()


**Guiding questions**
- Does the manual R² match the library value up to floating-point tolerance?
- What does the baseline R² tell you about the usefulness of the learned model?


### Exercise 3. Proper Train/Validation/Test flow with polynomial degree selection

In [None]:

# Split train/val/test
X_train, X_temp, y_train, y_temp = train_test_split(df[["x"]].values, df["y"].values, test_size=0.35, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

val_rows = []
best = {"deg": None, "mae": np.inf, "r2": -np.inf}

for deg in [1,2,3,4,5]:
    pipe = make_pipeline(PolynomialFeatures(deg, include_bias=False), LinearRegression())
    pipe.fit(X_train, y_train)
    yv = pipe.predict(X_val)
    mae = mean_absolute_error(y_val, yv)
    r2v = r2_score(y_val, yv)
    val_rows.append({"degree": deg, "val_MAE": mae, "val_R2": r2v})
    if mae < best["mae"]:
        best = {"deg": deg, "mae": mae, "r2": r2v}

val_table = pd.DataFrame(val_rows).sort_values("val_MAE").reset_index(drop=True)

# Refit best on train+val and evaluate on test once
X_trval = np.vstack([X_train, X_val])
y_trval = np.concatenate([y_train, y_val])
best_pipe = make_pipeline(PolynomialFeatures(best["deg"], include_bias=False), LinearRegression())
best_pipe.fit(X_trval, y_trval)
yt = best_pipe.predict(X_test)

final_mae = mean_absolute_error(y_test, yt)
final_mse = mean_squared_error(y_test, yt)
final_rmse = mean_squared_error(y_test, yt, squared=False)
final_r2 = r2_score(y_test, yt)

val_table.round(4), {"best_degree": best["deg"], "test_MAE": final_mae, "test_MSE": final_mse, "test_RMSE": final_rmse, "test_R2": final_r2}


### Plot: validation curve by polynomial degree

In [None]:

degrees = val_table["degree"].values
maes = val_table["val_MAE"].values

plt.figure()
plt.plot(degrees, maes, marker="o")
plt.title("Validation MAE vs Polynomial degree")
plt.xlabel("degree")
plt.ylabel("validation MAE")
plt.xticks(degrees)
plt.show()


**Guiding questions**
- Does a higher degree always improve validation MAE?
- Why should you refit on train+val before a single final test?


### Exercise 4. 10-fold cross-validation stability

In [None]:

best_deg = int(val_table.sort_values("val_MAE").iloc[0]["degree"])
pipe_cv = make_pipeline(PolynomialFeatures(best_deg, include_bias=False), LinearRegression())

cv = KFold(n_splits=10, shuffle=True, random_state=123)
neg_mae_scores = cross_val_score(pipe_cv, df[["x"]].values, df["y"].values, cv=cv, scoring="neg_mean_absolute_error")
mae_scores = -neg_mae_scores

pd.DataFrame({
    "fold": np.arange(1, len(mae_scores)+1),
    "MAE": mae_scores
}).assign(mean=mae_scores.mean(), std=mae_scores.std())


### Optional Exercise 5. Overfitting demo

- Try a high degree, such as 12 or 15, and repeat the process.
- Compare train MAE vs validation MAE to see overfitting.


## What to submit

- The executed notebook with answers to the guiding questions.
- Plots for the residuals, prediction vs truth, validation curve, and CV stability.
