

---

### **1. Question:**  
*Explain how you would handle missing data in a dataset with over 1 million records. What are the trade-offs of different imputation methods?*  

**Answer:**  
- **Deletion:** Drop rows/columns with missing values (simple but loses data).  
- **Mean/Median Imputation:** Fast but distorts variance and relationships.  
- **Model-Based Imputation (e.g., KNN, MICE):** More accurate but computationally expensive.  
- **Domain-Specific Imputation:** Use business logic (e.g., filling zeros for missing sales).  
**Trade-off:** Accuracy vs. computational cost. For large datasets, stochastic regression or approximate KNN may balance both.  

---

### **2. Question:**  
*How would you design an A/B test to evaluate a new recommendation algorithm? What metrics would you track, and how would you determine sample size?*  

**Answer:**  
- **Hypothesis:** Null (no difference) vs. alternative (new algorithm improves metrics).  
- **Metrics:** Click-through rate (CTR), conversion rate, revenue per user.  
- **Sample Size:** Use power analysis:  
  ```python
  from statsmodels.stats.power import TTestIndPower
  effect_size = 0.2  # Minimum detectable effect
  power = 0.8        # 80% power
  alpha = 0.05       # Significance level
  analysis = TTestIndPower()
  sample_size = analysis.solve_power(effect_size, power=power, alpha=alpha)
  ```  
- **Randomization:** Split users randomly, control for confounding variables (e.g., user demographics).  

---

### **3. Question:**  
*Explain the bias-variance trade-off in the context of a random forest model. How does increasing the number of trees affect this trade-off?*  

**Answer:**  
- **Bias:** Error due to overly simplistic assumptions (high bias = underfitting).  
- **Variance:** Error due to sensitivity to small fluctuations (high variance = overfitting).  
- **Random Forest:** Each tree has high variance but low bias (deep trees). Averaging trees reduces variance without increasing bias.  
- **More Trees:** Decreases variance further (smoother predictions), but has diminishing returns. Bias remains unchanged.  

---

### **4. Question:**  
*How would you optimize a Python script that processes a 10GB CSV file on a machine with 4GB RAM?*  

**Answer:**  
- **Chunking:** Use `pandas.read_csv(chunksize=10_000)` to process batches.  
- **Dask/Modin:** Parallel processing libraries.  
- **Data Types:** Reduce memory by converting floats to `float32` or categoricals.  
- **Filter Early:** Drop unused columns/rows first.  
- **Disk-Based Tools:** Use SQLite or DuckDB for out-of-core operations.  

---

### **5. Question:**  
*Derive the gradient for logistic regression with L2 regularization. Show the impact of the regularization term.*  

**Answer:**  
Logistic loss with L2:  
\[
J(\theta) = -\frac{1}{m} \sum [y \log(h_\theta(x)) + (1-y) \log(1-h_\theta(x))] + \frac{\lambda}{2m} \sum \theta_j^2
\]  
Gradient:  
\[
\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} + \frac{\lambda}{m} \theta_j
\]  
**Impact of λ:** Shrinks weights toward zero, preventing overfitting but may underfit if λ is too large.  

---

### **6. Question:**  
*How would you explain a p-value to a non-technical stakeholder?*  

**Answer:**  
"A p-value measures how surprising the data is if there’s no real effect. A small p-value (e.g., < 0.05) suggests the observed result (e.g., higher sales after a change) is unlikely due to random chance, so we consider it statistically significant. However, it doesn’t measure the size or importance of the effect."  

---

### **7. Question:**  
*Write SQL to find the second-highest salary in a table `employees` with columns `id`, `name`, `salary`.*  

**Answer:**  
```sql
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```  
*Alternative (handles ties):*  
```sql
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
```  

---

### **8. Question:**  
*How would you deploy a machine learning model as a REST API? Outline steps and tools.*  

**Answer:**  
1. **Serialize Model:** Save as `.pkl` (Python) or `.joblib`.  
2. **Framework:** Use Flask/FastAPI (Python) or Django.  
3. **API Endpoint:**  
   ```python
   from fastapi import FastAPI
   import pickle
   app = FastAPI()
   model = pickle.load(open("model.pkl", "rb"))
   @app.post("/predict")
   def predict(data: dict):
       return {"prediction": model.predict([data["features"]])[0]}
   ```  
4. **Containerize:** Docker + Kubernetes for scaling.  
5. **Monitor:** Log predictions, track latency/errors (Prometheus, Grafana).  

---

### **9. Question:**  
*Explain how a ROC curve works and how AUC is interpreted.*  

**Answer:**  
- **ROC Curve:** Plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at various thresholds.  
- **AUC (Area Under Curve):** Measures model’s ability to distinguish classes. AUC = 0.5 (random), AUC = 1.0 (perfect).  
- **Interpretation:** Higher AUC = better ranking of positives vs. negatives (not necessarily better calibration).  

---

### **10. Question:**  
*How would you detect and handle multicollinearity in a linear regression model?*  

**Answer:**  
- **Detection:** Calculate Variance Inflation Factor (VIF):  
  ```python
  from statsmodels.stats.outliers_influence import variance_inflation_factor
  VIF = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
  ```  
  VIF > 5–10 indicates high multicollinearity.  
- **Handling:** Drop one correlated variable, use PCA, or apply regularization (Ridge/Lasso).  

---

### **11. Question:**  
*Explain the difference between batch gradient descent, stochastic gradient descent (SGD), and mini-batch SGD. When would you use each?*  

**Answer:**  
- **Batch GD:** Updates weights using the entire dataset. Slow but precise (use for small datasets).  
- **SGD:** Updates per sample. Noisy but fast (use for large datasets or online learning).  
- **Mini-Batch SGD:** Balances both (common in deep learning; batch size 32–512).  

---

### **12. Question:**  
*How would you implement a custom loss function for an imbalanced classification problem in TensorFlow?*  

**Answer:**  
```python
def focal_loss(y_true, y_pred, alpha=0.25, gamma=2):
    y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
    ce = -y_true * tf.math.log(y_pred)  # Cross-entropy
    focal = ce * alpha * (1 - y_pred) ** gamma  # Down-weight easy examples
    return tf.reduce_mean(focal)
```  
*Use Case:* For class imbalance, focal loss reduces the impact of well-classified majority class examples.  

---

### **13. Question:**  
*What is the curse of dimensionality? How does it affect KNN, and how can you mitigate it?*  

**Answer:**  
- **Curse:** As dimensions increase, data becomes sparse, and distance metrics lose meaning.  
- **KNN Impact:** Distances between points converge, making neighbors irrelevant.  
- **Mitigation:** Feature selection, PCA, or manifold learning (t-SNE, UMAP).  

---

### **14. Question:**  
*Write a PySpark query to calculate the 7-day rolling average of sales per product.*  

**Answer:**  
```python
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg

window = Window.partitionBy("product_id").orderBy("date").rowsBetween(-6, 0)
df.withColumn("rolling_avg", avg("sales").over(window))
```  

---

### **15. Question:**  
*How would you troubleshoot a model that performs well on training data but poorly on validation data?*  

**Answer:**  
- **Overfitting:** Add regularization (L2/dropout), reduce model complexity, or get more data.  
- **Data Leakage:** Ensure no validation data leaks into training (e.g., time-series splits).  
- **Validation Set Representativeness:** Check if validation data matches real-world distribution.  
- **Hyperparameter Tuning:** Use cross-validation to optimize parameters.  

---




---

### **How to Stand Out in Data Science Internships**  
*(1-Page Cheat Sheet)*  

#### **1. Master Core Skills**  
- **Math/Stats:** Probability, linear algebra, hypothesis testing.  
- **Algorithms:** Implement ML models from scratch (e.g., gradient descent).  
- **Tools:** Python (NumPy, Pandas), SQL (window functions), Git.  

#### **2. Build a Portfolio**  
- **Projects > Certs:** End-to-end projects (cleaning → deployment).  
- **GitHub:** Documented code + live demos (Streamlit/FastAPI).  
- **Specialize:** Pick a niche (e.g., NLP, MLOps).  

#### **3. Crush Interviews**  
- **SQL:** CTEs, joins, optimization.  
- **Python:** Memory efficiency, OOP.  
- **Case Studies:** Use CRISP-DM framework.  

#### **4. Learn "Unsexy" Tech**  
- **Data Engineering:** Spark, Airflow, Delta Lake.  
- **MLOps:** Model monitoring, TF Serving.  
- **Cloud:** AWS/GCP certs (e.g., SageMaker).  

#### **5. Network Strategically**  
- **LinkedIn:** Post project breakdowns.  
- **Meetups:** Attend PyData/NeurIPS.  
- **Cold Outreach:** Message hiring managers with specific value propositions.  

#### **6. Stay Trend-Savvy**  
- **Papers:** Follow arXiv/Papers With Code.  
- **Compete:** Kaggle (even if you don’t win).  

#### **7. Optimize Your Resume**  
- **Quantify Impact:** “Improved X by Y% using Z.”  
- **Tailor:** Highlight relevant skills per role.  

#### **8. Business Acumen**  
- **Ask “Why?”:** Tie work to revenue (e.g., “Saved $2M/year”).  
- **Domain Knowledge:** Finance? Learn risk models. Healthcare? Study HIPAA.  

### **Key Mindset**  
- **Be T-shaped:** Deep in one area, broad in others.  
- **Show, don’t tell:** Deploy projects, write blogs.  
- **Hustle smartly:** Network with intent, follow up.  

---

