# üß™ Lab: Modelling & Model Lifecycle ‚Äî Predicting Plant Production (GIST Steel Dataset)

---

## üéØ Learning Outcomes

By completing this lab, you will be able to:

- Prepare and analyse a dataset for modelling.  
- Train and evaluate regression models.  
- Apply cross-validation and hyperparameter tuning using scikit-learn.  
- Track experiments and store models using MLflow and/or Optuna.  
- Reflect on the practical aspects of managing the ML lifecycle.

---



## üß© 1. Data Setup and Exploration
‚è± *Estimated time: 30‚Äì40 min*

### üß≠ Objective  
Understand the dataset structure and the target variable (‚Äúplant-level production‚Äù).

---

### **Task 1.1 ‚Äì Load and Inspect Data**
- Load the GIST Steel dataset. 
- Display basic info (shape, column names, missing values, and data types).  
- Identify the target variable (production) and key features (capacity, ...).



> üìù *Markdown prompt:*  
Describe any patterns or potential data quality issues you notice. Which variables might strongly influence production?

---

### **Task 1.2 ‚Äì Data Cleaning**
- Handle missing values appropriately (e.g., imputation, removal).  
- Check for outliers or incorrect entries in numerical columns.  
- Apply transformations if needed (e.g., log-transform for skewed distributions).

> üìù *Markdown prompt:*  
Explain your cleaning choices. Why did you treat the missing or skewed data in that way?

---




### **Task 1.3 ‚Äì Feature Engineering**
- Create at least two new variables that might improve model performance (e.g., ‚Äúcapacity per worker‚Äù, ‚Äúenergy efficiency‚Äù).  
- Encode categorical variables and standardize numeric ones.
- Bonus: you are free to use external socioeconomic or environmental data sources to enhance your feature set.

> üìù *Markdown prompt:*  
Document your new feature(s). What business or operational insight do they represent?

---



## üîç 1.4 Feature Relationships and Correlations
‚è± *Estimated time: 20‚Äì25 min*

### üß≠ Objective  
Before training models, it‚Äôs essential to understand how features relate to each other and to the target variable ‚Äî both linearly and nonlinearly. This helps identify redundant or uninformative predictors and guides model choice.

---

### **Task 1.4.1 ‚Äì Correlation Matrix (Linear Relationships)**
- Compute a **correlation matrix** (e.g., using `df.corr()`, `seaborn.heatmap`, `skrub`) to examine pairwise linear relationships among numerical features.  
- Focus on correlations between each feature and the target (`production`), as well as between features themselves.

> üìù *Markdown prompt:*  
Which variables show the strongest correlation with production?  
Do any features appear redundant or highly correlated with each other?

---


## üßÆ 2. Building Baseline & Linear Models
‚è± *Estimated time: 25‚Äì30 min*

### üß≠ Objective  
Establish a simple baseline, then train and interpret a linear model.

---

### **Task 2.1 ‚Äì Baseline**
- Compute a simple baseline predictor (e.g., mean or median production).  
- Measure RMSE or MAE compared to actual values.

> üìù *Markdown prompt:*  
Why is it useful to have a baseline model before trying more complex ones?

---



### **Task 2.2 ‚Äì Linear Regression**
- Train a multiple linear regression model using the key plant variables.  
- Display coefficients and interpret their meaning.  
- Evaluate the model on training and test data.

> üìù *Markdown prompt:*  
Interpret one positive and one negative coefficient. What do they tell you about plant performance drivers?

---


## üîÅ 3. Model Evaluation and Selection
‚è± *Estimated time: 45‚Äì60 min*

### üß≠ Objective  
Use cross-validation to estimate generalization performance and compare multiple model types.

---

### **Task 3.1 ‚Äì Cross-Validation**
- Apply **K-Fold cross-validation** (e.g., K=5).  
- Record the average RMSE, MAE, and R¬≤ across folds.

> üìù *Markdown prompt:*  
Summarize your results. How stable is performance across folds? What might this indicate about model variance?

---


### **Task 3.2 ‚Äì Model Comparison**
Train and compare at least **three models**:
- Linear Regression  
- Ridge Regression (regularized linear)  
- Random Forest Regressor  

Record cross-validation performance for each model.

> üìù *Markdown prompt:*  
Create a small results table. Which model performs best? Why might that be the case given the dataset‚Äôs characteristics?

---

### **Task 3.3 ‚Äì Hyperparameter Optimization**
- Use **RandomizedSearchCV** or **GridSearchCV** to tune the top model (e.g., Random Forest).  
- Report the best parameters and corresponding validation score.

> üìù *Markdown prompt:*  
Discuss the role of hyperparameter tuning. How did tuning change your model‚Äôs performance compared to default settings?

---


## ‚öôÔ∏è 4. Model Lifecycle: Tracking, Saving, and Loading
‚è± *Estimated time: 30‚Äì40 min*

### üß≠ Objective  
Apply tools that support reproducible ML experiments.

---

### **Task 4.1 ‚Äì Experiment Tracking with MLflow**
- Use MLflow to log parameters (model type, hyperparameters), metrics (RMSE, R¬≤), and artifacts (plots or model files).  
- Run and record at least two model experiments.

> üìù *Markdown prompt:*  
Describe how MLflow helps manage your experiments. What advantages does it give compared to manual tracking?

---

### **Task 4.2 ‚Äì Hyperparameter Optimization with Optuna**
- Define an Optuna study to optimize one model (e.g., Ridge or Random Forest).  
- Record the number of trials and best result.

> üìù *Markdown prompt:*  
Explain what Optuna is doing behind the scenes. How is it different from Grid or Random Search?

---

### **Task 4.3 ‚Äì Model Storage**
- Save the best performing model to a file (e.g., using joblib or MLflow‚Äôs model registry).  
- Demonstrate loading the saved model and re-evaluating it on the test set.

> üìù *Markdown prompt:*  
Why is it important to store both model parameters and metadata? How would you ensure version control of models in a production setting?

---


## üöÄ 5. Deployment & Monitoring (Conceptual)
‚è± *Estimated time: 15‚Äì20 min*

### üß≠ Objective  
Reflect on how models transition from training to production and stay reliable over time.

---

### **Task 5.1 ‚Äì Deployment Planning**
> üìù *Markdown prompt:*  
Describe how you would deploy your model in a business environment (e.g., via REST API, batch pipeline).  
Which metrics would you monitor in production?

---

### **Task 5.2 ‚Äì Detecting Model Drift**
> üìù *Markdown prompt:*  
What signs might indicate your model needs retraining?  
Give one example of **data drift** and one of **concept drift** relevant to steel plant production.

---

## üí¨ 6. Reflection
‚è± *Estimated time: 10‚Äì15 min*

> üìù *Markdown prompt:*  
1. Which step of the modelling lifecycle did you find most challenging and why?  
2. What would you do differently if you had access to additional plant-level data?  
3. How would you communicate model insights to a business audience?

---

‚úÖ **End of Lab**

Next week: Short quiz on theoretical concepts (distributions, regression, model selection, and experiment tracking).

