# Linear Regression

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Preprocessing

### **Exploring the dataset**

Let's start with loading the training data from the csv into a pandas dataframe



Load the datasets from GitHub. Train dataset has already been loaded for you in df below. To get test dataset use the commented code.

In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')

Let's see what the first 5 rows of this dataset looks like

In [None]:
df.head()

What are all the features present? What is the range for each of the features along with their mean?

In [None]:
# List of all columns (features)
print("Features present:\n", df.columns.tolist())

# Compute summary statistics for all features
df.describe()


### **Feature Scaling and One-Hot Encoding**

You must have noticed that some features `(such as Utilities)` are not continuous values.
  
These features contain values indicating different categories and must somehow be converted to numbers so that the computer can understand it. `(Computers only understand numbers and not strings)`
  
These features are called categorical features. We can represent these features as a `One-Hot Representation`
  
  
You must have also noticed that all the other features, each are in a different scale. This can be detremental to the performance of our linear regression model and so we normalize them so that all of them are in the range $[0,1]$

> NOTE: When you are doing feature scaling, store the min/max which you will use to normalize somewhere. This is then to be used at testing time. Try to think why are doing this?

In [None]:
# Do the one-hot encoding here
# This finds the 'Utilities' column and replaces it with new columns for each category
# e.g., 'Utilities_AllPub', 'Utilities_NoSeWa'
# We set drop_first=True to avoid multicollinearity, which is important for linear models
df_processed = pd.get_dummies(df_processed, columns=['Utilities'], drop_first=True)

# You can now see the new numerical columns instead of 'Utilities'
print("Data after One-Hot Encoding:")
print(df_processed.head())


In [None]:
# Do the feature scaling here
# 1. First, identify all numerical columns. We'll exclude IDs and the target variable 'SalePrice'
# as you don't want to scale them.
numerical_cols = df_processed.select_dtypes(include=['int64', 'float64']).columns
cols_to_scale = numerical_cols.drop(['Id', 'SalePrice']) # Don't scale the ID or the target

# 2. Initialize the scaler
# This object is what you will "store"
scaler = MinMaxScaler()

# 3. Fit and Transform the training data
# .fit() calculates the min and max for each column
# .transform() applies the (X - min) / (max - min) formula
df_processed[cols_to_scale] = scaler.fit_transform(df_processed[cols_to_scale])

# Now all your numerical features are between 0 and 1
print("\nData after Feature Scaling (showing scaled columns):")
print(df_processed[cols_to_scale].describe())

### **Conversion to NumPy**

Ok so now that we have all preprocessed all the data, we need to convert it to numpy for our linear regression model
  
Assume that our dataset has a total of $N$ datapoints. Each datapoint having a total of $D$ features (after one-hot encoding), we want our numpy array to be of shape $(N, D)$

In our task, we have to predict the `SalePrice`. We will need 2 numpy arrays $

*   List item
*   List item

(X, Y)$. These represent the features and targets respectively

In [None]:
# Convert to numpy array
import numpy as np

# 1. Create the Y (target) numpy array
# We select the 'SalePrice' column and use .values to get its numpy representation
Y = df_processed['SalePrice'].values

# 2. Create the X (features) numpy array
# We drop the target ('SalePrice') and the 'Id' column, then get the numpy representation
X = df_processed.drop(columns=['SalePrice', 'Id']).values

# 3. Check the shapes (N = datapoints, D = features)
N = X.shape[0]
D = X.shape[1]

print(f"Shape of X (features): {X.shape}")
print(f"Shape of Y (target): {Y.shape}")
print(f"Total datapoints (N): {N}")
print(f"Total features (D): {D}")



## Linear Regression formulation
  
We now have our data in the form we need. Let's try to create a linear model to get our initial (Really bad) prediction


Let's say a single datapoint in our dataset consists of 3 features $(x_1, x_2, x_3)$, we can pose it as a linear equation as follows:
$$ y = w_1x_1 + w_2x_2 + w_3x_3 + b $$
Here we have to learn 4 parameters $(w_1, w_2, w_3, b)$
  
  
Now how do we extend this to multiple datapoints?  
  
  
Try to answer the following:
- How many parameters will we have to learn in the cae of our dataset? (Don't forget the bias term)
- Form a linear equation for our dataset. We need just a single matrix equation which correctly represents all the datapoints in our dataset
- Implement the linear equation as an equation using NumPy arrays (Start by randomly initializing the weights from a standard normal distribution)

In [None]:
import numpy as np

# We assume X and Y are already defined from the previous step.
# For this example, let's create placeholders if they don't exist.
# (You can skip this part if you are running this in the same notebook as the previous step)
try:
    X.shape
    Y.shape
except NameError:
    print("X and Y not found, creating dummy data for demonstration.")
    N_dummy, D_dummy = 100, 10 # 100 datapoints, 10 features
    X = np.random.rand(N_dummy, D_dummy)
    Y = np.random.rand(N_dummy)

# --- Start of the required implementation ---

# Set a random seed for reproducibility (so we all get the same "random" weights)
np.random.seed(42)

# Get the dimensions of our data
N, D = X.shape

# 1. Initialize the parameters
# W = weights, initialized from a standard normal distribution
# The shape must be (D, 1)
W = np.random.randn(D, 1)

# b = bias, initialized to zero (a common practice)
b = np.zeros(1)

# 2. Implement the linear equation to get our (bad) initial predictions
# We use the '@' operator for matrix multiplication (np.dot(X, W) also works)
Y_pred = (X @ W) + b

# 3. Check the results
print(f"Shape of X: {X.shape} (N, D)")
print(f"Shape of W: {W.shape} (D, 1)")
print(f"Shape of b: {b.shape} (a scalar)")
print(f"---")
print(f"Shape of Y_pred: {Y_pred.shape} (N, 1)")
print(f"\nFirst 5 predictions:\n{Y_pred[:5]}")

How well does our model perform? Try comparing our predictions with the actual values

In [None]:
N = Y.shape[0]

# 1. Calculate the errors (residuals)
# Y is (N,) and Y_pred is (N, 1). We need to reshape Y to (N, 1) to subtract them.
# .reshape(-1, 1) tells numpy to make it a column vector with N rows.
errors = Y.reshape(-1, 1) - Y_pred

# 2. Square the errors
squared_errors = errors**2

# 3. Calculate the mean (the average)
mse = np.mean(squared_errors)

print(f"Our initial, random model's Mean Squared Error (MSE) is: {mse}")

### **Learning weights using gradient descent**

So these results are really horrible. We need to somehow update our weights so that it correclty represents our data. How do we do that?

We must do the following:
- We need some numerical indication for our performance, for this we define a Loss Function ( $\mathscr{L}$ )
- Find the gradients of the `Loss` with respect to the `Weights`
- Update the weights in accordance to the gradients: $W = W - \alpha\nabla_W \mathscr{L}$

Lets define the loss function:
- We will use the MSE loss since it is a regression task. (Specify the assumptions we make while doing so as taught in the class).
- Implement this loss as a function. (Use numpy as much as possible)

In [None]:
def mse_loss_fn(y_true, y_pred):


SyntaxError: incomplete input (ipython-input-957558406.py, line 1)

Calculate the gradients of the loss with respect to the weights (and biases). First write the equations down on a piece of paper, then proceed to implement it

In [None]:
def get_gradients(y_true, y_pred, W, b, X):
    """
    Calculates the gradients for the MSE loss function with respect to the weights (and bias)

    Args:
        y_true: The true values of the target variable (SalePrice in our case)
        y_pred: The predicted values of the target variable using our model (W*X + b)

        W: The weights of the model
        b: The bias of the model
        X: The input features

    Returns:
        dW: The gradients of the loss function with respect to the weights
        db: The gradients of the loss function with respect to the bias
    """

Update the weights using the gradients

In [None]:
def update(weights, bias, gradients_weights, gradients_bias, lr):
    """
    Updates the weights (and bias) using the gradients and the learning rate

    Args:
        weights: The current weights of the model
        bias: The current bias of the model

        gradients_weights: The gradients of the loss function with respect to the weights
        gradients_bias: The gradients of the loss function with respect to the bias

        lr: The learning rate

    Returns:
        weights_new: The updated weights of the model

    """

Put all these together to find the loss value, its gradient and finally updating the weights in a loop. Feel free to play around with different learning rates and epochs
  
> NOTE: The code in comments are just meant to be used as a guide. You will have to do changes based on your code

In [None]:
NUM_EPOCHS = 1000
LEARNING_RATE = 2e-2

losses = []

for epoch in range(NUM_EPOCHS):
    y_pred = x @ w.T + b
    loss = mse_loss_fn(y, y_pred)
    losses.append(loss)
    dw, db = get_gradients(y, y_pred, w, b, x)
    w, b = update(w, b, dw, db, LEARNING_RATE)


Now use matplotlib to plot the loss graph

In [None]:
plt.plot(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

### **Testing with test data**

Load and apply all the preprocessing steps used in the training data for the testing data as well. Remember to use the **SAME** min/max values which you used for the training set and not recalculate them from the test set. Also mention why we are doing this.

To load test data from GitHub, use the code below.


In [None]:
df_test = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/test_processed_splitted.csv')
print(df_test)

# Let's find all the columns that are missing in the test set
missing_cols = set(df.columns) - set(df_test.columns)

# Add these columns to the test set with all zeros
for col in missing_cols:
    df_test[col] = 0

if 'Utilities_AllPub' not in df_test.columns:
    df_test = df_test.join(pd.get_dummies(df_test['Utilities'], dtype = 'int32', prefix = 'Utilities'))
    df_test = df_test.drop('Utilities', axis = 1)



Using the weights learnt above, predict the values in the test dataset. Also answer the following questions:
- Are the predictions good?
- What is the MSE loss for the testset
- Is the MSE loss for testing greater or lower than training
- Why is this the case

In [None]:
# Scale the features

# Fill NaN values
df_test.fillna(0, inplace=True)

# Scale features


# Check for unexpected NaNs




# Convert to numpy array
x_test = df_test.copy().drop('SalePrice', axis=1).to_numpy() # (N, D)
y_test = df_test.copy()['SalePrice'].to_numpy().reshape(-1, 1) # (N, 1)
print(x_test.shape)


In [None]:
extra_cols = list(set(df_test.columns) - set(df.columns))
print("Extra columns in df_test:", extra_cols)

missing_cols = list(set(df.columns) - set(df_test.columns))
print("Missing columns in df_test:", missing_cols)

In [None]:
# Make predictions
y_pred_test = x_test @ w.T + b # (N, 1)
loss_test = mse_loss_fn(y_pred_test, y_test)


# Scale the predictions back to the original scale


In [None]:
idx = np.random.randint(0, x_test.shape[0], 5)
y_pred_test_sample = y_pred_test_scaled[idx].round().astype(int)
y_true_test_sample = y_test_scaled[idx].round().astype(int)

print('Predicted SalePrice: \t', y_pred_test_sample.squeeze().tolist())
print('Actual SalePrice: \t', y_true_test_sample.squeeze().tolist())
print('\nTest Loss: \t\t', loss_test)