# SGD Regression

Let's dive into **SGDRegressor** — a powerful and flexible regression model that uses **Stochastic Gradient Descent** to optimize the weights. We'll cover intuition, math, practical use, pros & cons, and examples.

---

## 🌟 What is SGDRegressor?

`SGDRegressor` is a **linear model** optimized using **Stochastic Gradient Descent (SGD)**, a fast and scalable optimization algorithm that updates model weights incrementally per sample or mini-batch.

It is part of `sklearn.linear_model`.

---

## 🧠 Intuition Behind Stochastic Gradient Descent

- In **regular gradient descent**, we calculate gradients using the **entire training dataset** → accurate but slow for large data.
- In **SGD**, we update weights using **one training sample at a time** (or a mini-batch) → faster but noisier.
- Over time, these noisy updates converge (with proper tuning) to a **minimum** of the loss function.

---

## 🧮 Objective Function

SGDRegressor minimizes a **regularized loss function**:

$
\text{Objective:} \quad \min_{w} \ \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \alpha R(w)
$

Where:
- $ L(y_i, \hat{y}_i) $: Loss function (default: squared loss)
- $ R(w) $: Regularization term (e.g., L1 or L2)
- $ \alpha $: Regularization strength

---

## 🎯 Supported Loss Functions

You can choose the loss function using `loss` parameter:

| Loss         | Description                      |
|--------------|----------------------------------|
| `'squared_error'` | Ordinary Least Squares        |
| `'huber'`     | Huber loss (robust to outliers) |
| `'epsilon_insensitive'` | Used in SVR            |
| `'squared_epsilon_insensitive'` | Variant of SVR |

---

## 🛡️ Regularization

`SGDRegressor` supports:

| Penalty     | Description       |
|-------------|-------------------|
| `'l2'`      | Ridge             |
| `'l1'`      | Lasso             |
| `'elasticnet'` | L1 + L2 combo  |

So `SGDRegressor` can behave like **Ridge, Lasso, or ElasticNet**, but optimized with SGD.

---

## 🔧 Important Parameters

| Parameter        | Meaning |
|------------------|---------|
| `loss`           | Loss function (`'squared_error'`, `'huber'`, etc.) |
| `penalty`        | Regularization type (`'l2'`, `'l1'`, `'elasticnet'`) |
| `alpha`          | Regularization strength (default: `0.0001`) |
| `max_iter`       | Maximum number of iterations |
| `tol`            | Tolerance to declare convergence |
| `learning_rate`  | `'constant'`, `'invscaling'`, `'adaptive'` |
| `eta0`           | Initial learning rate (used when learning_rate is `'constant'` or `'adaptive'`) |
| `early_stopping` | Stop training when validation score doesn’t improve |

---

## ✅ Pros and ❌ Cons

### ✅ Pros:
- **Fast and memory efficient**
- Good for **large-scale data**
- Supports various **losses and penalties**
- Can **stream in batches** (online learning)

### ❌ Cons:
- Requires **careful tuning** of hyperparameters (like learning rate)
- **Sensitive to feature scaling**
- Can **oscillate** if not tuned properly

---

## 🧠 When to Use `SGDRegressor`?

Use it when:
- You have **very large datasets**
- You want to use **online learning** (incremental updates)
- You need **speed over precision**
- You're comfortable tuning hyperparameters

---

## 🧪 Tips for Good Performance

- Always **scale** your features!
- Tune `alpha`, `learning_rate`, and `eta0`
- Try different `loss` functions (`'huber'` is good for noisy data)
- Enable `early_stopping=True` if using a validation set

In [1]:
# import necessary libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline

In [2]:
# load the dataset
car_data = pd.read_csv('../Data/CarPrice_Assignment.csv')
car_data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [3]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [6]:
# let us look at the unique set of values for each category
unique_values = {}

category_cols = car_data.select_dtypes(include=['O']).columns.tolist()
for col in category_cols:
    unique_values[col] = car_data[col].nunique()
    
unique_values


{'CarName': 147,
 'fueltype': 2,
 'aspiration': 2,
 'doornumber': 2,
 'carbody': 5,
 'drivewheel': 3,
 'enginelocation': 2,
 'enginetype': 7,
 'cylindernumber': 7,
 'fuelsystem': 8}

I find that except the feature `CarName` , all others have number of unique values less than 10. Also these values, do represent some properties of the car and can be helpful in predicting the price fpr each car. However, the name of a car does not seem to be a necessary variable to predict price of cars. 

In [None]:
# symboling seems to be a categorical variable coded as numerical variable 
car_data['symboling'].astype('O').nunique()

6

In [13]:
# convert numerical to object
car_data['symboling'] = car_data['symboling'].astype('O')
car_data['symboling'].dtype

dtype('O')

In [15]:
# let us create the final dataset to be used for prediction
car_data.drop(columns=['car_ID', 'CarName'], inplace=True)
car_data.columns

Index(['symboling', 'fueltype', 'aspiration', 'doornumber', 'carbody',
       'drivewheel', 'enginelocation', 'wheelbase', 'carlength', 'carwidth',
       'carheight', 'curbweight', 'enginetype', 'cylindernumber', 'enginesize',
       'fuelsystem', 'boreratio', 'stroke', 'compressionratio', 'horsepower',
       'peakrpm', 'citympg', 'highwaympg', 'price'],
      dtype='object')

In [23]:
# let us now see the categrories for each category separately

unique_cat = {}

category_cols = car_data.select_dtypes(include=['O']).columns.tolist()
for col in category_cols:
    unique_cat[col] = {
        'nunique': car_data[col].nunique(),
        'unique values': [cat for cat in car_data[col].unique()]
        }
    
unique_cat

{'symboling': {'nunique': 6, 'unique values': [3, 1, 2, 0, -1, -2]},
 'fueltype': {'nunique': 2, 'unique values': ['gas', 'diesel']},
 'aspiration': {'nunique': 2, 'unique values': ['std', 'turbo']},
 'doornumber': {'nunique': 2, 'unique values': ['two', 'four']},
 'carbody': {'nunique': 5,
  'unique values': ['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop']},
 'drivewheel': {'nunique': 3, 'unique values': ['rwd', 'fwd', '4wd']},
 'enginelocation': {'nunique': 2, 'unique values': ['front', 'rear']},
 'enginetype': {'nunique': 7,
  'unique values': ['dohc', 'ohcv', 'ohc', 'l', 'rotor', 'ohcf', 'dohcv']},
 'cylindernumber': {'nunique': 7,
  'unique values': ['four', 'six', 'five', 'three', 'twelve', 'two', 'eight']},
 'fuelsystem': {'nunique': 8,
  'unique values': ['mpfi',
   '2bbl',
   'mfi',
   '1bbl',
   'spfi',
   '4bbl',
   'idi',
   'spdi']}}

In [24]:
# let us separate out the numeric cols 
numeric_cols = car_data.select_dtypes(include=['number']).columns.tolist()
numeric_cols

['wheelbase',
 'carlength',
 'carwidth',
 'carheight',
 'curbweight',
 'enginesize',
 'boreratio',
 'stroke',
 'compressionratio',
 'horsepower',
 'peakrpm',
 'citympg',
 'highwaympg',
 'price']