## Lab Assignment Three: Extending Logistic Regression 
GROUP MEMBERS:
- **Alex Chen** 
- **Paige Maple** 
- **Sam Valentine**

### Sources
1. https://github.com/eclarson/MachineLearningNotebooks/blob/master/05.%20Logistic%20Regression.ipynb
2. ChatGPT (For formatting text and plots)

### Preparation and Overview (3 pts)

#### Part One (2 pts)
Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? As in previous labs, also detail how good the classifier needs to perform in order to be useful. 

1. **Overview of the Dataset and Its Purpose**  
   The **Steel Energy Demand Dataset** ([Kaggle link](https://www.kaggle.com/competitions/steel-industry-energy-consumption)) contains operational data from a steel production plant, with the objective of predicting **energy load demands** over 15-minute intervals. Each record is labeled with one of three possible **Load Types**: **light**, **medium**, or **maximum**. These categories reflect the level of energy usage placed on the plant’s systems at a given date and time.  
   The purpose of the dataset is twofold: (1) to enable the development of a **classification model** that can automatically forecast energy demand categories in real time, and (2) to support both the **steel plant** and the **electricity provider** in achieving efficiency. For the steel plant, predictions allow better allocation of resources—ensuring sufficient supply during maximum load while avoiding waste during light load. For the electricity provider (e.g., Korea Electric Power Corporation), reliable forecasts of demand help stabilize distribution and reduce operational strain.  

2. **Prediction Task**  
   For our project, the **prediction task** is to build a **multiclass classification model** that can accurately predict the **Load Type** for upcoming 15-minute intervals. Unlike regression tasks that estimate continuous consumption values, this task requires choosing one of three discrete categories: light, medium, or maximum. The baseline strategy—predicting the most common class (“light load”) for all cases—achieves **52% accuracy**, since “light” is the majority class. Our objective is to **outperform this baseline**, demonstrating the value of machine learning in boosting predictive accuracy and operational planning.  

3. **Why This Matters and Performance Expectations**  
   The results of this classification task matter for **multiple stakeholders**. For the **steel company**, improved forecasting reduces waste, lowers costs, and enhances production efficiency. For the **electricity provider**, demand prediction contributes to grid stability and energy savings. In practice, the model would need to be **deployed in real time**, classifying energy loads continuously to inform immediate decision-making.  
   To be considered useful, the classifier must perform **meaningfully better than 52% accuracy**, since that baseline can already be achieved without machine learning. Even moderate improvements beyond this threshold translate into tangible cost reductions and sustainability gains. Logistic regression provides a reasonable starting point, but exploring more advanced models may yield stronger performance and unlock **industrial-grade energy optimization**.  

#### Part Two (0.5 pt)
 (mostly the same processes as from previous labs) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis (give reasoning). Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). Provide a breakdown of the variables after preprocessing (such as the mean, std, etc. for all variables, including numeric and categorical). 

In [33]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# show dataset before preprocessing
df = pd.read_csv("../dataset/Steel_industry_data.csv")
print("Original dataset:")
df.info()
display(df.head())
display(df.describe())

# Data preprocessing
# change load types to numerical values (0 for light, 1 for medium, and 2 for max)
encoder = LabelEncoder()
df['Load_Type'] = encoder.fit_transform(df['Load_Type'])
# extract month/hour out of date and drop date
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df['hour'] = df['date'].dt.hour
df['month'] = df['date'].dt.month
df = df.drop(columns=['date'])
# drop redundent columns
df = df.drop(columns=['CO2(tCO2)'])
df = df.drop(columns=['NSM'])
df = df.drop(columns=['Day_of_week'])
# Convert weekstatus to binary values
df['WeekStatus'] = df['WeekStatus'].map({'Weekday': 0, 'Weekend': 1})
# normalization
norm_cols = [
    'Usage_kWh',
    'Lagging_Current_Reactive.Power_kVarh',
    'Leading_Current_Reactive_Power_kVarh',
    'Lagging_Current_Power_Factor',
    'Leading_Current_Power_Factor',
    'hour', 'month'
]
scaler = MinMaxScaler()
df[norm_cols] = scaler.fit_transform(df[norm_cols])

print("\nProcessed dataset:")
df.info()
display(df.head())
display(df.describe())


Original dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   date                                  35040 non-null  object 
 1   Usage_kWh                             35040 non-null  float64
 2   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 3   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 4   CO2(tCO2)                             35040 non-null  float64
 5   Lagging_Current_Power_Factor          35040 non-null  float64
 6   Leading_Current_Power_Factor          35040 non-null  float64
 7   NSM                                   35040 non-null  int64  
 8   WeekStatus                            35040 non-null  object 
 9   Day_of_week                           35040 non-null  object 
 10  Load_Type                             35040 non-null  object 
dt

Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type
0,01/01/2018 00:15,3.17,2.95,0.0,0.0,73.21,100.0,900,Weekday,Monday,Light_Load
1,01/01/2018 00:30,4.0,4.46,0.0,0.0,66.77,100.0,1800,Weekday,Monday,Light_Load
2,01/01/2018 00:45,3.24,3.28,0.0,0.0,70.28,100.0,2700,Weekday,Monday,Light_Load
3,01/01/2018 01:00,3.31,3.56,0.0,0.0,68.09,100.0,3600,Weekday,Monday,Light_Load
4,01/01/2018 01:15,3.82,4.5,0.0,0.0,64.72,100.0,4500,Weekday,Monday,Light_Load


Unnamed: 0,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM
count,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0
mean,27.386892,13.035384,3.870949,0.011524,80.578056,84.36787,42750.0
std,33.44438,16.306,7.424463,0.016151,18.921322,30.456535,24940.534317
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.2,2.3,0.0,0.0,63.32,99.7,21375.0
50%,4.57,5.0,0.0,0.0,87.96,100.0,42750.0
75%,51.2375,22.64,2.09,0.02,99.0225,100.0,64125.0
max,157.18,96.91,27.76,0.07,100.0,100.0,85500.0



Processed dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 9 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Usage_kWh                             35040 non-null  float64
 1   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 2   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 3   Lagging_Current_Power_Factor          35040 non-null  float64
 4   Leading_Current_Power_Factor          35040 non-null  float64
 5   WeekStatus                            35040 non-null  int64  
 6   Load_Type                             35040 non-null  int32  
 7   hour                                  35040 non-null  float64
 8   month                                 35040 non-null  float64
dtypes: float64(7), int32(1), int64(1)
memory usage: 2.3 MB


Unnamed: 0,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,Lagging_Current_Power_Factor,Leading_Current_Power_Factor,WeekStatus,Load_Type,hour,month
0,0.020168,0.030441,0.0,0.7321,1.0,0,0,0.0,0.0
1,0.025449,0.046022,0.0,0.6677,1.0,0,0,0.0,0.0
2,0.020613,0.033846,0.0,0.7028,1.0,0,0,0.0,0.0
3,0.021059,0.036735,0.0,0.6809,1.0,0,0,0.043478,0.0
4,0.024303,0.046435,0.0,0.6472,1.0,0,0,0.043478,0.0


Unnamed: 0,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,Lagging_Current_Power_Factor,Leading_Current_Power_Factor,WeekStatus,Load_Type,hour,month
count,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0
mean,0.174239,0.13451,0.139443,0.805781,0.843679,0.284932,0.760959,0.5,0.502366
std,0.212778,0.168259,0.267452,0.189213,0.304565,0.451388,0.857523,0.300969,0.313446
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.020359,0.023733,0.0,0.6332,0.997,0.0,0.0,0.25,0.272727
50%,0.029075,0.051594,0.0,0.8796,1.0,0.0,0.0,0.5,0.545455
75%,0.32598,0.233619,0.075288,0.990225,1.0,1.0,2.0,0.75,0.818182
max,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0


The **class variable** in this dataset is `Load_Type`, which represents three distinct classes:  
- **0 → Light_Load**  
- **1 → Medium_Load**  
- **2 → Maximum_Load**  

During preprocessing, `month` and `hour` were extracted from the `date` column. The following redundant columns were removed:  
- `date` and `NSM` (time information already captured by `month` and `hour`)  
- `Day_of_Week` (correlated with `WeekStatus`)  
- `CO2(tCO2)` (derived from `Usage_kWh`, adding no new information)  

`WeekStatus` was converted to a binary variable (**0 = Weekday, 1 = Weekend**).  

Numerical features were normalized using **MinMaxScaler** to ensure all values lie within the range [0,1], preventing weight bias in the model. The normalized features are:  
- `Usage_kWh`  
- `Lagging_Current_Reactive.Power_kVarh`  
- `Leading_Current_Reactive_Power_kVarh`  
- `Lagging_Current_Power_Factor`  
- `Leading_Current_Power_Factor`  
- `hour`  
- `month`  

The final dataset contains 9 features:  
1. **Usage_kWh** – Energy consumed (normalized)  
2. **Lagging_Current_Reactive.Power_kVarh** – Reactive power for lagging current (normalized)  
3. **Leading_Current_Reactive_Power_kVarh** – Reactive power for leading current (normalized)  
4. **Lagging_Current_Power_Factor** – Efficiency measure for lagging current (normalized)  
5. **Leading_Current_Power_Factor** – Efficiency measure for leading current (normalized)  
6. **hour** – Hour of the day (normalized)  
7. **month** – Month of the year (normalized)  
8. **WeekStatus** – Binary indicator of weekday (0) or weekend (1)  
9. **Load_Type** – Target variable (0=Light, 1=Medium, 2=Maximum)  

A full breakdown of the dataset (mean, standard deviation, min, max) was examined both before and after preprocessing to verify transformations and confirm normalization.  

#### Part Three (0.5 pt)
Divide your data into training and testing splits using an 80% training and 20% testing split. Use the data splitting modules that are part of scikit-learn. Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?

In [None]:
from sklearn.model_selection import train_test_split

# Features/target
X = df.drop(columns=['Load_Type'])
y = df['Load_Type']  # 0=Light, 1=Medium, 2=Maximum

# 80/20 split, preserve class proportions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

print(X_train.shape, X_test.shape)
print(y_train.value_counts(normalize=True).sort_index())
print(y_test.value_counts(normalize=True).sort_index())

(28032, 8) (7008, 8)
Load_Type
0    0.515732
1    0.207549
2    0.276719
Name: proportion, dtype: float64
Load_Type
0    0.515839
1    0.207477
2    0.276684
Name: proportion, dtype: float64


An 80/20 train-test split is appropriate for this dataset because it provides a strong balance between training and evaluation. With over 35,000 total records, splitting 80% for training yields more than 28,000 samples for the model to learn from, while reserving about 7,000 samples for testing. This ensures that the model has sufficient data to fit patterns in the steel plant’s energy usage while still holding out enough examples to reliably evaluate performance.  

Another benefit of the 80/20 split is that it preserves efficiency without being wasteful. Using much more data for testing (e.g., a 70/30 split) would reduce the training data unnecessarily, while a smaller test set (e.g., 90/10) could risk underrepresenting some load classes. By applying stratification, the class proportions remain consistent between training and testing, which is especially important since the dataset is not perfectly balanced.  

Overall, the 80/20 split is a widely accepted standard in machine learning that achieves a reasonable compromise. It supports model generalization, provides robust evaluation metrics, and fits well with the size and structure of this dataset.  

### Modeling (5 pts)

#### Part One (2 pts)
Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template developed by the instructor in the course. You should add the following functionality to the logistic regression classifier:
Ability to choose optimization technique when class is instantiated: either steepest ascent, stochastic gradient ascent, and {Newton's method/Quasi Newton methods}. It is recommended to call this the "solver" input for the class.
Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1 and L2 regularization). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.  

#### Part Two (1.5 pts)
Train your classifier to achieve good generalization performance. That is, adjust the optimization technique and the value of the regularization term(s) "C" to achieve the best performance on your test set. Visualize the performance of the classifier versus the parameters you investigated.
Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?

#### Part Three (1.5 pts)
Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time and classification performance. Discuss the results. 

### Deployment (1 pt)
Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party implementation)? Why?

### Exceptional Work (1 pt)
Option One: Implement an optimization technique for logistic regression using mean square error as your objective function (instead of maximum likelihood). Derive the gradient updates for the Hessian and use Newton's method to update the values of "w". Then answer, which process do you prefer: maximum likelihood OR minimum mean-squared error? 