# Asteroid Diameter Prediction with Linear Regression (SGD)

**Project Goal:** To develop a Linear Regression model to predict asteroid diameter based on orbital and physical features. The Stochastic Gradient Descent (SGD) optimizer will be
implemented from scratch.

## Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

## Data Loading and Preparation

In this section, we load, inspect, and prepare the data for modeling.

### Load and Split Data

We load the dataset and split it into training and test sets as per the project requirements (80/20 split, `random_state=42`)

In [None]:
df = pd.read_csv('NASA_JPL_Dataset.csv')

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe().T

In [None]:
X = df.drop(columns =['diameter']) # Features
y = df['diameter'] # Target variable

# Split the data
X_train , X_test , y_train , y_test = train_test_split(X, y,
                                                       test_size =0.2,
                                                       random_state =42)

### Preprocessing

#### Remove irrelevant columns

In [None]:
# # define list of irrelevant columns
# cols_to_drop = [
#     "column_name",
#

# X_train.drop(columns=cols_to_drop, inplace=True)
# X_test.drop(columns=cols_to_drop, inplace=True)

In [None]:
X_train.columns

#### EDA and Visualization (optional)

In [None]:
# Example: Correlation matrix
numeric_columns = df.select_dtypes(include=['int64', 'float64'])
plt.figure(figsize=(20, 6))
Corr = numeric_columns.corr()
sns.heatmap(Corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Numerical Features')
plt.show()

#### Feature Engineering (optional)

In [None]:
# Add new features. (e.g. if we have feats like x, y and z, you can add x^i times y^j times z^k where at i + j + k >= 2)
# Warning: it can overfit the model.

#### Encoding

Categorical columns must be converted to numeric format.

In [None]:
categorical = X_train.select_dtypes(include='object').columns
label_encoder = LabelEncoder()

#### Outlier Detection (optional)

In [None]:
# Drop some outlier instances based on std and percentile.

#### Feature Scaling

To ensure stable convergence for SGD, all features must be scaled.

In [None]:
numeric_columns = X_train.select_dtypes(include='float32').columns
scaler = StandardScaler()

In [None]:
X_train

In [None]:
X_test

## Model Implementation and Training (From Scratch)

Here, we build the core of the project: the `SGDRegressorScratch` class. This class will implement the linear regression model and the Stochastic Gradient Descent optimizer from scratch.

To improve the score and meet the bonus target , techniques like Momentum, Learning Rate Scheduling, or Regularization can be added to the `SGDRegressorScratch` class.

In [None]:
class SGDRegressorScratch:
    """
    Implementation of the SGD Regression model from scratch.

    Parameters:
    -----------
    learning_rate (float): The learning rate for weight updates.
    n_epochs (int): The number of passes over the entire dataset.
    random_state (int): Ensures reproducible results for data shuffling.
    """
    def __init__(self, learning_rate, n_epochs, random_state):
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.random_state = random_state
        self.weights_ = None

    def fit(self, X, y):
        """
        Fit the model to the training data.

        Parameters:
        -----------
        X (array-like): Feature matrix of shape (n_samples, n_features).
        y (array-like): Target vector of shape (n_samples,).
        """
        
        pass
                
        return self

    def predict(self, X):
        """
        Predict values for new data.

        Parameters:
        -----------
        X (array-like): Feature matrix of shape (n_samples, n_features).

        Returns:
        -------
        array: Predicted values.
        """
        
        pass

In [None]:
Epochs = 100
Learning_rate = 0.1
random_state = 42
sgd_scratch_model = SGDRegressorScratch(
    n_epochs=Epochs,
    learning_rate=Learning_rate,
    random_state=random_state
)

sgd_scratch_model.fit(X_train, y_train.values)

y_pred_scratch = sgd_scratch_model.predict(X_test)

R2_Score = r2_score(y_test, y_pred_scratch)
MAE_Score = mean_absolute_error(y_test, y_pred_scratch)
MSE_Score = mean_squared_error(y_test, y_pred_scratch)

print(f"R² Scratch SGD Score: {R2_Score:.4f}")
print(f"MAE SGD Score: {MAE_Score:.4f}")
print(f"MSE SGD Score: {MSE_Score:.4f}")
