# üß†üìö Job Interview Guide Workshop

![Image Description](./images/OnlineJobInterview2.png)



## üéØ Learning Objectives
- Practice **AI-mediated interview preparation** with a mix of *technical* and *behavioral/scenario-based* questions.
- Reinforce core ML topics from the course through **targeted exercises**.
- Produce a **graded, personalized study notebook** tailored to quiz results.
- Demonstrate professionalism with **clear documentation** and **version control** (GitHub).



## üìò Topics Covered (Review Scope)
1. **Supervised vs. Unsupervised** learning algorithms  
2. **Dependent vs. Independent Variables**  
3. **Train / Validation / Test Split** (data leakage, stratification)  
4. **Linear Regression**: residuals, linearization  
5. **Regression Analysis**: parametric vs. non-parametric, **R¬≤**, **MSE**  
6. **Logistic Regression**: intercept, slope, **cross-entropy**  
7. **K-Nearest Neighbors (KNN)**: hyperparameters  
8. **Decision Trees**: leaf nodes & predictions  



## üß≠ Workflow
1. Collect workshop `.ipynb` files and study guide ‚Üí create `StudyGuide.txt` and `StudyMaterials.zip`
2. Open a new LLM session and upload both files
3. Paste the interview prompt ‚Üí complete the 15-question quiz
4. Record results ‚Üí complete exercises ‚Üí push to GitHub



## üìù Copy-Paste Prompt for Your LLM Session

```text
You are a seasoned Data Scientist, Machine Learning Engineer, and technical interviewer.
I am a Data Scientist and ML Engineer, fresh out of college. You will interview me for an ML Specialist role.

1) Unzip and read StudyMaterials.zip. Produce a 500-word summary of the ML learning content and coding patterns.
2) Read StudyGuide.txt. Produce a 100-word summary of interview topics emphasized.
3) Match study guide topics to workshop materials. Create a table listing each topic, coverage, and gaps.
4) Create 15 multiple-choice questions (A-E) spanning all topics. Ask one at a time. Score me after all 15.
5) Based on wrong answers, generate a personalized JobInterviewGuide_Workshop.ipynb with targeted exercises.
Stop here and wait for my command to start the quiz.
```



## üßÆ Grading Rubric
| Component | Description | Weight |
|---|---|---|
| **Quiz Performance** | Score across 15 technical + scenario/behavioral questions | **40%** |
| **Generated Study Notebook Quality** | Accuracy, completeness, clarity, and relevance | **40%** |
| **Reflection** | Insightful, concise self-assessment | **20%** |



---
## ü§ñ LLM Mock Interview Session ‚Äî 15-Question Quiz

> Full question-and-answer transcript from the LLM interview session conducted using **Claude (Anthropic)**.



### Quiz Transcript

---

**Q1. [Supervised Learning]**  
A company wants to build a model to predict next month's sales revenue from historical data. Which type of learning best describes this task?

A) Reinforcement learning  
B) Unsupervised clustering  
C) Supervised regression  
D) Dimensionality reduction  
E) Semi-supervised classification  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: Predicting a continuous numeric output (sales revenue) from labeled historical data is a supervised regression task. The target variable is known and continuous.*

---

**Q2. [Unsupervised Learning]**  
Which algorithm is an example of unsupervised learning?

A) Decision Tree  
B) K-Nearest Neighbors  
C) Linear Regression  
D) K-Means Clustering  
E) Logistic Regression  

**Answer given: D ‚úÖ CORRECT**  
*Explanation: K-Means discovers natural groupings in data without any labeled output. All other options are supervised ‚Äî they require a labeled target variable during training.*

---

**Q3. [Independent vs. Dependent Variables]**  
You are building a model to predict blood pressure using a patient's age, weight, and salt intake. Which of the following is an independent variable?

A) Blood pressure  
B) Patient outcome  
C) Salt intake  
D) Hospital admission rate  
E) Model accuracy  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: Independent variables (features/inputs) are what we feed into the model. Blood pressure is the dependent variable (target). Salt intake, age, and weight are independent variables.*

---

**Q4. [Train/Val/Test Split]**  
You have 10,000 samples and want a 70/15/15 train/val/test split. What is the correct approach?

A) Split once into 70% train and 30% test; use train for validation  
B) Split into 70% train and 30% temp; then split temp 50/50 into val and test  
C) Use all data for training and evaluate with cross-validation only  
D) Randomly shuffle and take every 3rd sample for each set  
E) Use the first 70% of rows for train and last 30% for testing  

**Answer given: B ‚úÖ CORRECT**  
*Explanation: The two-step split is the correct method: first split off 30% as a temporary holdout, then split that 50/50 to yield 15% val and 15% test. This ensures no data contamination.*

---

**Q5. [Data Leakage ‚Äî Target Leakage]**  
You include a feature `days_since_last_payment` in a loan default prediction model. After deployment the model performs much worse than in testing. What likely happened?

A) The model had too many features  
B) The test set was too small  
C) The feature is a result of the default event ‚Äî it is only available after the label is known  
D) The model overfitted to noise  
E) K was set too high in KNN  

**Answer given: D ‚ùå INCORRECT** *(Correct answer: C)*  
*Explanation: This is target leakage ‚Äî the feature `days_since_last_payment` is causally downstream of the default event itself. During training the model learned a shortcut that does not exist at prediction time, producing inflated test performance and collapsed deployment performance.*

---

**Q6. [Linear Regression ‚Äî MSE & R¬≤]**  
Two linear regression models are trained on the same dataset. Model A has R¬≤ = 0.88, MSE = 12.4. Model B has R¬≤ = 0.76, MSE = 24.1. Which model should you prefer and why?

A) Model B ‚Äî lower R¬≤ means less overfitting  
B) Model A ‚Äî higher R¬≤ and lower MSE indicate better fit and smaller prediction errors  
C) Model B ‚Äî higher MSE means the model is more conservative  
D) They are equivalent ‚Äî R¬≤ and MSE always agree  
E) Model A ‚Äî but only if the dataset has fewer than 1,000 samples  

**Answer given: B ‚úÖ CORRECT**  
*Explanation: R¬≤ closer to 1 means more variance explained; lower MSE means smaller squared errors. Model A is strictly better on both metrics ‚Äî prefer it.*

---

**Q7. [Residuals]**  
After fitting a linear regression, you plot residuals vs. fitted values and notice the spread of residuals increases as fitted values increase (funnel shape). This indicates:

A) The model is underfitting  
B) There are too many features  
C) Heteroscedasticity ‚Äî the variance of errors is not constant  
D) The target variable is normally distributed  
E) The model is perfectly calibrated  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: A funnel-shaped residual plot is the signature of heteroscedasticity ‚Äî the OLS assumption of constant error variance (homoscedasticity) is violated. Fix: log-transform the target or use weighted least squares.*

---

**Q8. [Logistic Regression ‚Äî Intercept]**  
In a logistic regression model, the intercept Œ≤‚ÇÄ represents:

A) The slope of the decision boundary  
B) The model's accuracy when all features equal zero  
C) The log-odds of the positive class when all independent variables equal zero  
D) The total variance explained by the model  
E) The cross-entropy loss at convergence  

**Answer given: B ‚ùå INCORRECT** *(Correct answer: C)*  
*Explanation: Œ≤‚ÇÄ is the log-odds (not probability or accuracy) of the positive class when every feature is zero. To convert to probability: P = sigmoid(Œ≤‚ÇÄ) = 1/(1+exp(-Œ≤‚ÇÄ)). Accuracy is a post-hoc classification metric, not part of the model equation.*

---

**Q9. [Logistic Regression ‚Äî Cross-Entropy]**  
Cross-entropy loss for binary classification is computed as:

A) Sum of squared differences between predicted and actual values  
B) Mean absolute error between probabilities and labels  
C) ‚àí[y¬∑log(p) + (1‚àíy)¬∑log(1‚àíp)] averaged over all samples  
D) R¬≤ subtracted from 1  
E) The number of misclassified samples divided by total samples  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: Binary cross-entropy (log loss) formula is ‚àí[y¬∑log(p) + (1‚àíy)¬∑log(1‚àíp)]. When y=1, only ‚àílog(p) matters. When y=0, only ‚àílog(1‚àíp) matters. This heavily penalizes confident wrong predictions.*

---

**Q10. [KNN ‚Äî Effect of k]**  
You train a KNN classifier with k=1 and observe 100% training accuracy but 61% test accuracy. What is the most likely cause?

A) The features are not normalized  
B) The dataset is too large for KNN  
C) k=1 causes the model to overfit ‚Äî it memorizes each training point  
D) The distance metric is incorrect  
E) The class labels are imbalanced  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: With k=1, every training point is its own nearest neighbor, guaranteeing 100% training accuracy. The decision boundary is jagged and overfits noise, causing poor generalization.*

---

**Q11. [KNN ‚Äî Feature Scaling]**  
Why must features be standardized or normalized before applying KNN?

A) KNN cannot handle categorical variables without scaling  
B) Scaling reduces the number of neighbors to compute  
C) Features with larger numeric ranges will dominate distance calculations, biasing the model  
D) Scaling converts the problem from regression to classification  
E) KNN requires all features to be binary  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: A feature ranging 0‚Äì10,000 (e.g., income) will overwhelm one ranging 0‚Äì1 (e.g., age normalized) in Euclidean distance, making the latter nearly irrelevant. Scaling equalizes feature contributions.*

---

**Q12. [Decision Trees ‚Äî Splitting Criteria]**  
A decision tree uses Gini impurity to decide splits. A node has Gini = 0.0. What does this mean?

A) The node splits the data into equal halves  
B) The node is the root of the tree  
C) The node contains samples from all classes equally  
D) The node is perfectly pure ‚Äî all samples belong to the same class  
E) The tree has reached maximum depth  

**Answer given: D ‚úÖ CORRECT**  
*Explanation: Gini impurity = 0 means all samples at that node share the same class label ‚Äî perfect purity. The algorithm stops splitting at pure nodes (they become leaf nodes).*

---

**Q13. [Decision Trees ‚Äî Prediction]**  
During inference, how does a decision tree classify a new sample?

A) It averages predictions from all nodes  
B) It applies a sigmoid to the root node output  
C) It routes the sample through splits from root to a leaf, then outputs the leaf's majority class  
D) It computes Euclidean distance to the nearest training sample  
E) It calculates cross-entropy across all branches  

**Answer given: C ‚úÖ CORRECT**  
*Explanation: Prediction traverses the tree top-down: at each internal node, the sample takes the branch that satisfies the split condition. When it reaches a leaf node, the leaf outputs its stored majority class label.*

---

**Q14. [Scenario ‚Äî Choosing Between Models]**  
A hospital needs to predict patient readmission risk. The model must have both high recall (catch most at-risk patients) and explainability for medical staff. Which approach is best?

A) Deep neural network ‚Äî highest accuracy  
B) KNN ‚Äî straightforward prediction logic  
C) Random Forest with SHAP values for explainability  
D) Logistic Regression ‚Äî interpretable coefficients and probability outputs  
E) K-Means clustering ‚Äî groups high-risk patients automatically  

**Answer given: A ‚ùå INCORRECT** *(Correct answer: D)*  
*Explanation: Logistic regression provides probability estimates (enabling a low threshold for high recall), directly interpretable coefficients that clinicians can reason about, and regulatory compliance in medical settings. Deep neural networks are opaque and not suitable when explainability is required. SHAP + Random Forest is a reasonable runner-up, but logistic regression is the safest and most practical first choice here.*

---

**Q15. [Scenario ‚Äî Handling Overfitting]**  
Your decision tree achieves 99% training accuracy but only 65% on the validation set. Which TWO actions would most effectively address this? (Choose the best single answer among combined options.)

A) Increase the number of features and remove the validation set  
B) Set `max_depth` and `min_samples_leaf` to constrain tree complexity  
C) Use a larger learning rate and add more trees  
D) Replace the decision tree with linear regression  
E) Reduce the training set size to 50%  

**Answer given: B ‚úÖ CORRECT**  
*Explanation: `max_depth` limits how deep the tree grows (prevents memorization). `min_samples_leaf` requires a minimum number of samples at each leaf (avoids splits on noise). Together they are the two most direct regularization tools for decision trees.*

---

### üèÜ Final Score: **12 / 15 (80%)**

| # | Topic | Result |
|---|---|---|
| 1 | Supervised vs. Unsupervised (regression) | ‚úÖ |
| 2 | Unsupervised Learning (K-Means) | ‚úÖ |
| 3 | Independent vs. Dependent Variables | ‚úÖ |
| 4 | Train / Val / Test Split procedure | ‚úÖ |
| 5 | Data Leakage ‚Äî Target Leakage | ‚ùå |
| 6 | R¬≤ and MSE comparison | ‚úÖ |
| 7 | Residuals ‚Äî Heteroscedasticity | ‚úÖ |
| 8 | Logistic Regression Intercept (Œ≤‚ÇÄ) | ‚ùå |
| 9 | Cross-Entropy Formula | ‚úÖ |
| 10 | KNN ‚Äî k=1 Overfitting | ‚úÖ |
| 11 | KNN ‚Äî Feature Scaling | ‚úÖ |
| 12 | Decision Tree ‚Äî Gini Impurity | ‚úÖ |
| 13 | Decision Tree ‚Äî Prediction Traversal | ‚úÖ |
| 14 | Scenario ‚Äî Model Selection (Hospital) | ‚ùå |
| 15 | Scenario ‚Äî Overfitting Decision Tree | ‚úÖ |

**Topics to review:** Target Leakage (Q5), Logistic Regression Intercept Œ≤‚ÇÄ (Q8), Model Selection for Explainability (Q14)



## ‚úÖ Record Your Results


In [None]:

quiz_results = {
    "name": "Kevinkumar Patel",
    "date": "2026-02-27",
    "model_used": "Claude (Anthropic)",
    "overall_score": 80,
    "num_correct": 12,
    "topics_missed": [
        "Q5  - Target Leakage: Diagnosed as overfitting instead of recognizing a causally downstream feature",
        "Q8  - Logistic Regression Intercept: Confused log-odds with accuracy/probability when all features = 0",
        "Q14 - Model Selection: Chose deep neural network over logistic regression for an explainability-required task"
    ],
    "behavioral_notes": (
        "Good systematic reasoning on split procedure (Q4) and tree regularization (Q15). "
        "Struggled to identify subtle forms of data leakage and to apply explainability constraints "
        "when selecting models in clinical scenarios."
    ),
    "next_steps_from_llm": (
        "1) Study target leakage with concrete examples: features derived from or caused by the label. "
        "2) Review logistic regression math from scratch: intercept is log-odds at X=0, convert via sigmoid. "
        "3) Build a model selection decision framework: when accuracy vs explainability vs recall trade-offs apply. "
        "4) Practice 2 scenario-based questions daily focusing on stakeholder constraints."
    )
}

print('Quiz results recorded.')
print(f"Score: {quiz_results['num_correct']}/15 ({quiz_results['overall_score']}%)")
print(f"Topics missed ({len(quiz_results['topics_missed'])}):")  
for t in quiz_results['topics_missed']:
    print(f"  ‚Ä¢ {t}")



## üóÇ Coverage vs. Gaps (Generated by LLM)

| Topic | Covered in Workshop Materials | Notes / Gaps |
|---|---|---|
| Supervised Learning (regression & classification) | ‚úÖ Yes | Examples in LinearRegressionArchitecture and LogisticRegressionClassifier |
| Unsupervised Learning (clustering) | ‚ùå Gap | K-Means or clustering not present in any workshop notebook |
| Dependent vs. Independent Variables | ‚úÖ Yes | Clearly defined in regression workshops |
| Train / Val / Test Split Procedure | ‚úÖ Yes | Implemented in PerformanceMetricsClassification |
| Target Leakage | ‚ùå Gap | Only preprocessing leakage is hinted at; feature-level (target) leakage not addressed |
| Data Leakage via Preprocessing | ‚ö†Ô∏è Partial | Splitting before scaling is shown but not highlighted as an explicit lesson |
| R¬≤ and MSE Interpretation | ‚úÖ Yes | Computed and discussed in LinearRegressionArchitecture_Workshop |
| Heteroscedasticity / Residual Diagnostics | ‚ö†Ô∏è Partial | Residuals are plotted but heteroscedasticity is not named or diagnosed |
| Logistic Regression ‚Äî Intercept (Œ≤‚ÇÄ) | ‚ö†Ô∏è Partial | Intercept printed in output but its meaning in log-odds terms is not explained |
| Logistic Regression ‚Äî Cross-Entropy Formula | ‚úÖ Yes | log_loss used and reported in LogisticRegressionClassifier |
| KNN Hyperparameters (k, distance, scaling) | ‚úÖ Yes | Fully explored in KNearestNeighbors_Workshop |
| Decision Tree ‚Äî Gini Impurity & Splitting | ‚ö†Ô∏è Partial | Tree is built and evaluated but splitting criteria not explained |
| Decision Tree ‚Äî Prediction Traversal | ‚úÖ Yes | plot_tree and predictions demonstrated |
| Model Selection Trade-offs (explainability, recall) | ‚ùå Gap | Models are evaluated in isolation; no comparative decision framework |
| Scenario-Based / Behavioral Questions | ‚ùå Gap | No interview simulation or scenario practice in the materials |



## üß™ Targeted Practice

### 1) Train / Validation / Test Split & Target Leakage üî™

Implement a correct stratified split and demonstrate **target leakage** ‚Äî where a feature that is causally caused by the label inflates training performance and collapses at deployment.


In [None]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n = 2000

# Simulate a loan default dataset
income        = np.random.normal(50000, 15000, n).clip(10000, 120000)
credit_score  = np.random.randint(300, 850, n)
loan_amount   = np.random.normal(15000, 5000, n).clip(1000, 50000)
default       = ((income < 35000) | (credit_score < 500)).astype(int)
default       = np.where(np.random.rand(n) < 0.05, 1 - default, default)  # 5% noise

# Target-leaking feature: days_since_last_payment is only known AFTER default occurs
days_since_last_payment = np.where(default == 1,
                                   np.random.randint(60, 180, n),   # defaulters miss payments
                                   np.random.randint(1,  30,  n))   # non-defaulters pay on time

df = pd.DataFrame({
    'income': income,
    'credit_score': credit_score,
    'loan_amount': loan_amount,
    'days_since_last_payment': days_since_last_payment,  # ‚Üê LEAKING FEATURE
    'default': default
})

print(f"Dataset: {df.shape} | Default rate: {df['default'].mean():.2%}")

# ‚îÄ‚îÄ Stratified 70 / 15 / 15 split ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
features_safe  = ['income', 'credit_score', 'loan_amount']
features_leaky = ['income', 'credit_score', 'loan_amount', 'days_since_last_payment']
y = df['default']

def make_splits(df, feature_cols):
    X = df[feature_cols]
    X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.30,
                                                 stratify=y, random_state=42)
    X_val, X_te, y_val, y_te = train_test_split(X_tmp, y_tmp, test_size=0.50,
                                                 stratify=y_tmp, random_state=42)
    return X_tr, X_val, X_te, y_tr, y_val, y_te

# ‚úÖ Correct model: no leaking feature
X_tr, X_val, X_te, y_tr, y_val, y_te = make_splits(df, features_safe)
model_safe = GradientBoostingClassifier(random_state=42).fit(X_tr, y_tr)
print(f"\n‚úÖ Safe features only:")
print(f"   Train acc: {accuracy_score(y_tr,  model_safe.predict(X_tr)) :.4f}")
print(f"   Val   acc: {accuracy_score(y_val, model_safe.predict(X_val)):.4f}")
print(f"   Test  acc: {accuracy_score(y_te,  model_safe.predict(X_te)) :.4f}")

# ‚ùå Leaky model: includes days_since_last_payment
X_tr2, X_val2, X_te2, y_tr2, y_val2, y_te2 = make_splits(df, features_leaky)
model_leaky = GradientBoostingClassifier(random_state=42).fit(X_tr2, y_tr2)
print(f"\n‚ùå With TARGET-LEAKING feature (days_since_last_payment):")
print(f"   Train acc: {accuracy_score(y_tr2,  model_leaky.predict(X_tr2)) :.4f}  ‚Üê inflated!")
print(f"   Val   acc: {accuracy_score(y_val2, model_leaky.predict(X_val2)):.4f}  ‚Üê inflated!")
print(f"   Test  acc: {accuracy_score(y_te2,  model_leaky.predict(X_te2)) :.4f}  ‚Üê inflated!")
print("\n‚ö†Ô∏è  In production, days_since_last_payment is not yet known at prediction time.")
print("   The model would fail silently. This is target leakage.")



### 2) Linear Regression ‚Äî Residuals & Heteroscedasticity üìä

Fit linear regression on a dataset with heteroscedastic errors. Diagnose using residual plots and apply a **log-transform** to fix the violation.


In [None]:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from matplotlib import pyplot as plt

np.random.seed(7)

# Generate heteroscedastic data: variance grows with X
X_raw = np.linspace(1, 10, 400)
noise  = np.random.normal(0, X_raw * 1.5)   # variance ‚àù X  ‚Üí heteroscedastic
y_raw  = 3.5 * X_raw + noise
X_2d   = X_raw.reshape(-1, 1)

# ‚îÄ‚îÄ Model A: Linear on raw data (violated assumption) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
linA = LinearRegression().fit(X_2d, y_raw)
predA = linA.predict(X_2d)
residA = y_raw - predA

# ‚îÄ‚îÄ Model B: Linear on log(y) ‚Äî fixes heteroscedasticity ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
y_log  = np.log(y_raw.clip(min=0.01))
linB   = LinearRegression().fit(X_2d, y_log)
predB  = linB.predict(X_2d)
residB = y_log - predB

print("Model A (raw y):")
print(f"  R¬≤ = {r2_score(y_raw, predA):.4f} | MSE = {mean_squared_error(y_raw, predA):.4f}")
print(f"  Residual std: {residA.std():.4f}")
print("\nModel B (log(y)):")
print(f"  R¬≤ = {r2_score(y_log, predB):.4f} | MSE = {mean_squared_error(y_log, predB):.4f}")
print(f"  Residual std: {residB.std():.4f}")

fig, axes = plt.subplots(2, 2, figsize=(13, 8))

# Raw fit
axes[0, 0].scatter(X_raw, y_raw, alpha=0.4, s=15)
axes[0, 0].plot(X_raw, predA, color='red', linewidth=2)
axes[0, 0].set(title='Model A: Raw y', xlabel='X', ylabel='y')

# Raw residuals ‚Äî funnel shape shows heteroscedasticity
axes[0, 1].scatter(predA, residA, alpha=0.4, s=15)
axes[0, 1].axhline(0, color='red', linestyle='--')
axes[0, 1].set(title='Model A Residuals\n(Funnel = Heteroscedasticity ‚ùå)', xlabel='Fitted', ylabel='Residual')

# Log-transformed fit
axes[1, 0].scatter(X_raw, y_log, alpha=0.4, s=15)
axes[1, 0].plot(X_raw, predB, color='green', linewidth=2)
axes[1, 0].set(title='Model B: log(y)', xlabel='X', ylabel='log(y)')

# Log residuals ‚Äî random scatter confirms fix
axes[1, 1].scatter(predB, residB, alpha=0.4, s=15)
axes[1, 1].axhline(0, color='green', linestyle='--')
axes[1, 1].set(title='Model B Residuals\n(Random scatter = Fixed ‚úÖ)', xlabel='Fitted', ylabel='Residual')

plt.tight_layout()
plt.show()
print("\nüí° Log-transforming the target stabilizes variance and restores OLS assumptions.")



### 3) Logistic Regression ‚Äî Intercept, Cross-Entropy & Model Selection üìà

Interpret Œ≤‚ÇÄ, compare logistic regression vs. a black-box model using cross-entropy and explainability metrics.


In [None]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import log_loss, accuracy_score, recall_score, roc_auc_score

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
feat_names = X.columns.tolist()

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)

# ‚îÄ‚îÄ Logistic Regression ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
lr_pipe = Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=2000))])
lr_pipe.fit(X_tr, y_tr)
lr_proba = lr_pipe.predict_proba(X_te)[:, 1]
lr_pred  = lr_pipe.predict(X_te)

lr_model = lr_pipe.named_steps['lr']
beta0    = lr_model.intercept_[0]
p_at_zero = 1 / (1 + np.exp(-beta0))

print("=" * 60)
print("  LOGISTIC REGRESSION ‚Äî Intercept Interpretation (Q8 review)")
print("=" * 60)
print(f"\n  Œ≤‚ÇÄ (intercept) = {beta0:.4f}")
print(f"  Interpretation: when ALL features = 0 (after scaling),")
print(f"  the log-odds of the positive class = {beta0:.4f}")
print(f"  ‚Üí P(y=1 | all features = 0) = sigmoid({beta0:.4f}) = {p_at_zero:.4f}")
print(f"\n  ‚ö†Ô∏è  Œ≤‚ÇÄ is NOT the accuracy. It is the log-odds baseline.")

# ‚îÄ‚îÄ Random Forest (for comparison) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X_tr, y_tr)
rf_proba = rf.predict_proba(X_te)[:, 1]
rf_pred  = rf.predict(X_te)

# ‚îÄ‚îÄ Side-by-side comparison ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Recall (sensitivity)', 'AUC-ROC', 'Cross-Entropy', 'Explainable?'],
    'Logistic Regression': [
        f"{accuracy_score(y_te, lr_pred):.4f}",
        f"{recall_score(y_te, lr_pred):.4f}",
        f"{roc_auc_score(y_te, lr_proba):.4f}",
        f"{log_loss(y_te, lr_proba):.4f}",
        "‚úÖ Yes ‚Äî coefficients map directly to log-odds"
    ],
    'Random Forest': [
        f"{accuracy_score(y_te, rf_pred):.4f}",
        f"{recall_score(y_te, rf_pred):.4f}",
        f"{roc_auc_score(y_te, rf_proba):.4f}",
        f"{log_loss(y_te, rf_proba):.4f}",
        "‚ùå No ‚Äî requires SHAP/LIME for post-hoc explanation"
    ]
})

print("\n" + "=" * 60)
print("  MODEL COMPARISON (Q14 review)")
print("=" * 60)
print(comparison.to_string(index=False))
print("\nüí° When explainability is mandatory (medical, legal, financial):")
print("   Logistic Regression is preferred even if RF has slightly better accuracy.")
print("   Adjust the decision threshold to balance precision/recall for high-recall needs.")

# Show top 5 most impactful features for logistic regression
coef_df = pd.DataFrame({
    'feature': feat_names,
    'coefficient': lr_model.coef_[0]
}).reindex(lr_model.coef_[0].argsort()[::-1]).head(5)
print("\nTop 5 features by logistic regression coefficient (log-odds contribution):")
print(coef_df.to_string(index=False))



### 4) KNN & Decision Tree ‚Äî Full Pipeline with Hyperparameter Tuning üå≥


In [None]:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Use Wine dataset (3 classes, 13 features)
X, y = load_wine(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=3)
print(f"Wine dataset ‚Äî Train: {len(X_tr)} | Test: {len(X_te)} | Classes: {np.unique(y)}")

# ‚îÄ‚îÄ KNN with cross-validation across k ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nKNN cross-validation across k (5-fold):")
ks = range(1, 21)
cv_means, cv_stds = [], []

for k in ks:
    pipe_knn = Pipeline([
        ('sc',  StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k, metric='minkowski', weights='uniform'))
    ])
    scores = cross_val_score(pipe_knn, X_tr, y_tr, cv=5, scoring='accuracy')
    cv_means.append(scores.mean())
    cv_stds.append(scores.std())

best_k = list(ks)[np.argmax(cv_means)]
print(f"Best k = {best_k} | CV accuracy = {max(cv_means):.4f} ¬± {cv_stds[np.argmax(cv_means)]:.4f}")

# Final test evaluation
best_knn = Pipeline([('sc', StandardScaler()),
                     ('knn', KNeighborsClassifier(n_neighbors=best_k))])
best_knn.fit(X_tr, y_tr)
print(f"Test accuracy (k={best_k}): {accuracy_score(y_te, best_knn.predict(X_te)):.4f}")

# Plot CV results
cv_means_arr = np.array(cv_means)
cv_stds_arr  = np.array(cv_stds)
plt.figure(figsize=(9, 4))
plt.plot(list(ks), cv_means_arr, marker='o', linewidth=2, label='CV Mean Accuracy')
plt.fill_between(list(ks), cv_means_arr - cv_stds_arr,
                 cv_means_arr + cv_stds_arr, alpha=0.2, label='¬±1 std')
plt.axvline(best_k, color='red', linestyle='--', label=f'Best k={best_k}')
plt.xlabel('k'); plt.ylabel('CV Accuracy')
plt.title('KNN: Cross-Validated Accuracy vs. k (Wine Dataset)')
plt.legend(); plt.tight_layout(); plt.show()

# ‚îÄ‚îÄ Decision Tree: depth vs. train/test accuracy ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\nDecision Tree ‚Äî Depth vs. Accuracy (Q15 review):")
rows = []
for depth in [1, 2, 3, 4, 5, 8, None]:
    dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=2, random_state=0)
    dt.fit(X_tr, y_tr)
    rows.append({
        'max_depth': str(depth), 'leaves': dt.get_n_leaves(),
        'train_acc': round(dt.score(X_tr, y_tr), 4),
        'test_acc':  round(dt.score(X_te, y_te),  4)
    })
print(pd.DataFrame(rows).to_string(index=False))
print("\nüí° max_depth=None ‚Üí overfitting. max_depth=3-4 gives best generalization here.")

# Visualize best depth
dt_best = DecisionTreeClassifier(max_depth=3, min_samples_leaf=2, random_state=0)
dt_best.fit(X_tr, y_tr)
feature_names = load_wine().feature_names
plt.figure(figsize=(14, 6))
plot_tree(dt_best, feature_names=feature_names,
          class_names=['class_0', 'class_1', 'class_2'],
          filled=False, rounded=True, fontsize=8)
plt.title(f'Decision Tree (max_depth=3, min_samples_leaf=2) ‚Äî Wine Dataset')
plt.tight_layout(); plt.show()



## üî¨ Extra Practice ‚Äî Reinforcing Missed Topics

### Exercise A: Target Leakage Detection Checklist


In [None]:

import pandas as pd

# Feature audit checklist ‚Äî ask these questions before training any model
print("=" * 65)
print("  TARGET LEAKAGE DETECTION CHECKLIST")
print("=" * 65)

checklist = [
    ("Is the feature available BEFORE the prediction time?",
     "days_since_last_payment ‚Äî only exists after default occurs ‚Üí ‚ùå LEAKY"),
    ("Is the feature causally downstream of the label?",
     "collection_calls_made ‚Äî only happens if default happens ‚Üí ‚ùå LEAKY"),
    ("Does the feature have suspiciously high correlation with the label?",
     "correlation > 0.95 between a feature and label is a red flag ‚Üí ‚ö†Ô∏è INVESTIGATE"),
    ("Would this feature be available at inference time in production?",
     "credit_score, income, loan_amount ‚Äî all known before lending ‚Üí ‚úÖ SAFE"),
    ("Was this feature engineered from future data?",
     "30-day rolling average computed using future timestamps ‚Üí ‚ùå LEAKY"),
]

for i, (question, example) in enumerate(checklist, 1):
    print(f"\n  {i}. {question}")
    print(f"     Example: {example}")

print("\nüí° Rule of thumb: if a feature would NOT exist at prediction time")
print("   in a production system, it must be excluded from training.")

# Show what extreme correlation looks like
import numpy as np
np.random.seed(42)
n = 500
default_labels = np.random.randint(0, 2, n)
leaky_feature  = default_labels + np.random.normal(0, 0.05, n)
safe_feature   = np.random.normal(50000, 15000, n)

print(f"\nCorrelation with label:")
print(f"  Leaky feature: {np.corrcoef(leaky_feature, default_labels)[0,1]:.4f} ‚Üê suspicious!")
print(f"  Safe  feature: {np.corrcoef(safe_feature,  default_labels)[0,1]:.4f} ‚Üê normal")



### Exercise B: Logistic Regression Intercept ‚Äî Math Walkthrough


In [None]:

import numpy as np
from matplotlib import pyplot as plt

# Demonstrate what beta_0 controls: the default probability when all features = 0
print("=" * 60)
print("  LOGISTIC REGRESSION INTERCEPT ‚Äî MATHEMATICAL WALKTHROUGH")
print("=" * 60)

beta_0_values = [-3, -1, 0, 1, 3]
beta_1 = 1.5

print("\nWhen X=0, only Œ≤‚ÇÄ matters:")
for b0 in beta_0_values:
    log_odds_at_zero = b0
    p_at_zero = 1 / (1 + np.exp(-b0))
    print(f"  Œ≤‚ÇÄ={b0:+.1f} ‚Üí log-odds={log_odds_at_zero:+.1f} ‚Üí P(y=1|X=0) = {p_at_zero:.4f}")

print("\n‚ö†Ô∏è  Œ≤‚ÇÄ is NOT accuracy ‚Äî it shifts the sigmoid curve left/right.")
print("   Larger Œ≤‚ÇÄ = higher baseline probability (curve shifted left)")
print("   Smaller Œ≤‚ÇÄ = lower baseline probability (curve shifted right)")

# Visualize how Œ≤‚ÇÄ shifts the decision boundary
X_range = np.linspace(-5, 5, 300)
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for b0 in beta_0_values:
    prob = 1 / (1 + np.exp(-(b0 + beta_1 * X_range)))
    axes[0].plot(X_range, prob, label=f'Œ≤‚ÇÄ={b0:+.0f}')
axes[0].axhline(0.5, color='black', linestyle=':', linewidth=1)
axes[0].set(xlabel='X', ylabel='P(y=1)', title=f'Effect of Œ≤‚ÇÄ on Decision Boundary (Œ≤‚ÇÅ={beta_1})')
axes[0].legend(fontsize=8)

# Decision boundary shifts with Œ≤‚ÇÄ
decision_boundaries = [-b0 / beta_1 for b0 in beta_0_values]
axes[1].barh([f'Œ≤‚ÇÄ={b0:+.0f}' for b0 in beta_0_values], decision_boundaries)
axes[1].axvline(0, color='gray', linestyle='--')
axes[1].set(xlabel='Decision boundary (X where P=0.5)', title='Decision Boundary Location')
plt.tight_layout()
plt.show()

print("\nüìå Summary:")
print("  Œ≤‚ÇÄ ‚Üí shifts where the sigmoid crosses 0.5 (the decision boundary)")
print("  Œ≤‚ÇÅ ‚Üí controls the steepness / how fast probability changes with X")
print("  Neither Œ≤‚ÇÄ nor Œ≤‚ÇÅ directly represents accuracy.")



## üß∞ (Optional) Local Material Checks


In [None]:

import os, zipfile, textwrap

for fname in ['StudyGuide.txt', 'StudyMaterials.zip']:
    print(f'{fname}:', 'FOUND' if os.path.exists(fname) else 'NOT FOUND')

if os.path.exists('StudyGuide.txt'):
    with open('StudyGuide.txt', 'r', encoding='utf-8', errors='ignore') as f:
        content = f.read(700)
    print('\n--- StudyGuide (first 700 chars) ---\n')
    print(textwrap.shorten(content, width=700))

if os.path.exists('StudyMaterials.zip'):
    with zipfile.ZipFile('StudyMaterials.zip', 'r') as z:
        print('\n--- ZIP Contents ---')
        for info in z.infolist()[:30]:
            print(info.filename, info.file_size, 'bytes')



## ü™û Reflection

**1. Which 1‚Äì2 concepts were most challenging, and why?**

The most challenging concept was **target leakage** ‚Äî specifically, recognizing when a feature is causally *caused by* the label rather than being a cause *of* it. In Q5, I diagnosed the deployment failure as overfitting because the symptoms looked identical: good training performance, poor real-world performance. I missed the root cause, which was that `days_since_last_payment` only exists after a default occurs, making it impossible to observe at prediction time. The second challenging concept was the **logistic regression intercept Œ≤‚ÇÄ**: I incorrectly mapped it to accuracy rather than recognizing it as the log-odds baseline when all features equal zero, which is a subtly different quantity.

**2. What trade-offs or assumptions did you overlook in the interview?**

In Q14, I prioritized raw accuracy over explainability when selecting a model for a clinical setting. I overlooked the hard constraint that medical stakeholders legally and ethically require interpretable predictions ‚Äî a neural network would likely not pass hospital compliance review regardless of its accuracy. I also failed to consider that logistic regression's probability outputs allow clinicians to tune the classification threshold for high recall (catching most at-risk patients), making it the better fit for the stated requirements. The lesson is to identify hard constraints (regulation, explainability, latency) before optimizing for accuracy.

**3. What is your plan to improve over the next week?**

This week I will work through five real-world case studies of target leakage (credit risk, healthcare, e-commerce churn) and for each feature I will ask: *would this be available in production before the prediction is needed?* For logistic regression, I will write out the full math ‚Äî from linear combination to log-odds to sigmoid to probability ‚Äî by hand at least three times until Œ≤‚ÇÄ and Œ≤‚ÇÅ feel intuitive. For model selection, I will build a simple decision matrix with axes for accuracy, explainability, inference speed, and regulatory context, and practice applying it to at least two scenario questions per day until the framework is automatic.



## üö¢ Submission (GitHub)
- Ensure the notebook is **executed** and **saved** with your recorded results and completed exercises.
- Add (in a new notebook called **JobInterviewLLMSession.ipynb**) the entire contents of the LLM session from start to end.
- Push to your GitHub repository:

```bash
git add JobInterviewGuide_Workshop.ipynb 
git add JobInterviewLLMSession.ipynb
git commit -m "Add graded JobInterviewGuide_Workshop notebook and LLM session"
git push origin main
```
