# Linear Regression

1. The Ordinary Least Square method is very sensitive to the outliers. As the error increases quadratically, due to the existance of outliers, the OLS fitted line deviates from the majority of the data points.We can clearly see the line is aligned towards the outliers.

2. Scheme 1 will be better. We need to reduce the impact of outliers on the calculation of loss. 
Therefore we need to multiply the loss caused by outliers by a scalar significantly less than what we use for inliers. Therefore using $a_i=0.01$ for outliers and $a_i = 1 $ for inliers is preffered.  
On the other hand if we consider the scheme 2, multiplying outliers by larger value boost the impact of outliers for the loss function and makes the loss function more sensitive to the outliers.

3. 

4. The primary goal is to identify the regions of the brain that are predictive. The interest lies in the region, not on individual voxels. There for Group Lasso is more appropriate in this situation, where features have pre-defined group structure.  
Standard lasso will try to identify predictive individual voxels and the result will be a sparse set of vocels scattered across all regions. So it would be difficult to conclude a region is predictive when only a handful of voxels have non-zero weights.  
When it comes to Group Lasso it is either whole region or not. This is exactly what we need. The model will contain only the regions that are predictive as a whole. Non predictive regions are removed from the model.  

# Logistic Regression

In [18]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the penguins dataset
df = sns. load_dataset ("penguins")
df. dropna ( inplace = True )

# Filter rows for 'Adelie ' and 'Chinstrap ' classes
selected_classes = ['Adelie', 'Chinstrap']
df_filtered = df[df['species'].isin( selected_classes )].copy ()

# Make a copy to avoid the warning
# Initialize the LabelEncoder
le = LabelEncoder ()

# Encode the species column
y_encoded = le.fit_transform( df_filtered ['species'])
df_filtered['class_encoded'] = y_encoded

# Display the filtered and encoded DataFrame
print(df_filtered [['species', 'class_encoded']])

# Split the data into features (X) and target variable (y)
y = df_filtered['class_encoded'] # Target variable
X = df_filtered.drop(['class_encoded'], axis =1)

       species  class_encoded
0       Adelie              0
1       Adelie              0
2       Adelie              0
4       Adelie              0
5       Adelie              0
..         ...            ...
215  Chinstrap              1
216  Chinstrap              1
217  Chinstrap              1
218  Chinstrap              1
219  Chinstrap              1

[214 rows x 2 columns]


2. It raised a Value Error.    
    ```ValueError: could not convert string to float: 'Adelie'```  
This is due to some categorical data containing string values instealof numerical values. So to resolve this we need to convert these categories into numerical values in a way that they carry a meaning.  
We can use integer encoding but then the model might interpret in a a wrong way. For example if we use 1,2,3 for the species model might think there is some connection with the numerical value like 3 has something more compared to 1,2.  
So the best choice is using one-hot encoding. With one-hot encoding we turn each category into a new feature, it the data point belong to that category it is 1 otherwise 0.  
However after looking at the dataset, we have 3 categorical fields. Which are species, island and sex. Here both species and sex have only  2 classes. So we will use binary encoding. And then again the species is what we are trying to predict. So from X we will drop the species field and apply encoding for island and sex.

In [19]:
X.drop(['species'], axis=1, inplace=True)

le_sex = LabelEncoder()
X['sex'] = le_sex.fit_transform(X['sex'])
X = pd.get_dummies(X, drop_first=True)

X.head()


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,island_Dream,island_Torgersen
0,39.1,18.7,181.0,3750.0,1,False,True
1,39.5,17.4,186.0,3800.0,0,False,True
2,40.3,18.0,195.0,3250.0,0,False,True
4,36.7,19.3,193.0,3450.0,0,False,True
5,39.3,20.6,190.0,3650.0,1,False,True


In [48]:
# Split the data into training and testing sets
X_train , X_test , y_train , y_test = train_test_split (X, y,test_size =0.2 , random_state =42)
#   Train the logistic regression model . Here we are using saga
#   solver to learn weights .

logreg = LogisticRegression (solver ='saga')
logreg.fit( X_train , y_train )

# Predict on the testing data
y_pred = logreg.predict ( X_test )

# Evaluate the model
accuracy = accuracy_score (y_test , y_pred )
print("Accuracy :", accuracy )
print(logreg .coef_ , logreg.intercept_ )

Accuracy : 0.5813953488372093
[[ 2.76753245e-03 -8.14776377e-05  4.90136556e-04 -2.88108827e-04
   1.10826588e-05  1.85825877e-04 -1.05453230e-04]] [-8.39126135e-06]




3. SAGA uses stochastic optimization. Since out dataset is small the stochastic nature introduce more variance and instability, leading to slightly worse results and slower convergence. Even it the warning message above it is mentioned that model failed to converge after the maximum iterations.

In [30]:
logreg = LogisticRegression (solver ='liblinear')
logreg.fit( X_train , y_train )

# Predict on the testing data
y_pred = logreg.predict ( X_test )

# Evaluate the model
accuracy = accuracy_score (y_test , y_pred )
print("Accuracy :", accuracy )
print(logreg .coef_ , logreg.intercept_ )

Accuracy : 1.0
[[ 1.5152457  -1.39159164 -0.14412318 -0.00365549 -0.22642547  0.73456302
  -0.56189275]] [-0.07740334]


4. Liblinear has a classification accuracy of 1(100%).

5. First of all we have a small dataset. Liblinear's deterministic approach is more stable and converge reliably. Also our dataset must have less variance.No randomness in optimization, so the results are consistent.

6. - The saga solver uses stochastic optimization, which means it updates model weights using random subsets of data. This introduces randomness in the training process. 
   - We have a small dataset. With a small dataset, the effect of randomness is amplified. Small changes in the train/test split  can lead to significant differences in which samples are used for training and testing.

7. Let's apply Standard Scaling for bill length, bill depth, flipper length and body mass and see the performace. 

In [49]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()

cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X[cols] = scalar.fit_transform(X[cols])

X_train , X_test , y_train , y_test = train_test_split (X, y,test_size =0.2 , random_state =42)

In [50]:
logreg = LogisticRegression (solver ='saga')
logreg.fit( X_train , y_train )

# Predict on the testing data
y_pred = logreg.predict ( X_test )

# Evaluate the model
accuracy = accuracy_score (y_test , y_pred )
print("Accuracy :", accuracy )
print(logreg .coef_ , logreg.intercept_ )

Accuracy : 1.0
[[ 3.46414296 -0.56732773  0.36066423 -0.50455708 -1.10198737  1.39696348
  -0.73051649]] [-2.0494785]


In [51]:
logreg = LogisticRegression (solver ='liblinear')
logreg.fit( X_train , y_train )

# Predict on the testing data
y_pred = logreg.predict ( X_test )

# Evaluate the model
accuracy = accuracy_score (y_test , y_pred )
print("Accuracy :", accuracy )
print(logreg .coef_ , logreg.intercept_ )

Accuracy : 1.0
[[ 3.45311312 -0.48395813  0.3741515  -0.43662111 -1.42545299  0.88332086
  -0.9827446 ]] [-1.29148352]


After scaling both models have 100% accuracy. 
As a Gradient-based solver, saga is sensitive to the scale of features. If features have very different scales, the optimization can be slow or may not converge well. The Standard Scaling have minimized the high variance of data which stopped saga from converging

8. That approach is not correct. Whatever we do in dataprocessing must have some sort of meaning. When the using integer encoding for a feature without an ordinal relationship the value of those integers does not hold a meaning. For example if we use green-2 and red-3, here even though 3>2, it does not mean red has something more compared to green. So applying a scaler for these type of categorical data is pointless.  
The real problem here is with the encoding method that has been used here. So what I propose is use of one hot encoding rather instead of integer encoding since this feature does not have any ordinal ralationship.

# Logistic Regression First/Second-Order Methods

In [70]:
import numpy as np
import matplotlib . pyplot as plt
import numpy as np
from sklearn . datasets import make_blobs
# Generate synthetic data
np. random . seed (0)
centers = [[ -5 , 0], [5, 1.5]]
X, y = make_blobs ( n_samples =2000 , centers = centers , random_state =5)
transformation = [[0.5 , 0.5] , [ -0.5 , 1.5]]
X = np.dot(X, transformation )


In [71]:
y=y.reshape(-1,1)

2. The weights were initialized to zeros. For logistic regression, the loss function is convex, meaning it has a single global minimum. Initializing with zeros is a simple and computationally efficient starting point that is guaranteed to converge to the global minimum.

In [72]:
# --- Helper Functions for Logistic Regression ---

def sigmoid(z):
    """Computes the sigmoid function."""
    return 1 / (1 + np.exp(-z))

def compute_loss(y, y_pred):
    """Computes the binary cross-entropy loss."""
    m = len(y)
    return -1/m * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

# --- BGD Implementation ---

def batch_gradient_descent(X, y, iterations=20, learning_rate=0.1):
    m, n = X.shape
    weights = np.zeros((n, 1)) # Initialize weights to zero
    loss_history = []

    for i in range(iterations):
        # 1. Make predictions
        y_pred = sigmoid(X @ weights)
        
        # 2. Calculate loss
        loss = compute_loss(y, y_pred)
        loss_history.append(loss)

        # 3. Calculate gradient
        gradient = (1/m) * X.T @ (y_pred - y)
        
        # 4. Update weights
        weights -= learning_rate * gradient
        
    return weights, loss_history

# Run BGD on the first dataset
weights, loss_history = batch_gradient_descent(X, y, iterations=20, learning_rate=0.1)
print(f"Final BGD Loss: {loss_history[-1]:.4f}")

Final BGD Loss: 0.0499


3. For the loss function, Binary Cross Entropy aka Negative Log Likelihood has been selected. This is the standard loss function for binary classification problems. It heavily penalizes predictions that are both confident and incorrect, which effectively guides the model toward a better solution.

Newton's method is a second-order optimization algorithm that uses the Hessian matrix (matrix of second partial derivatives) to converge more quickly than first-order methods like gradient descent. The weight update rule is:  $$\theta = \theta-H^{-1} \nabla J(\theta)  $$ 
Where $ H $ is the Hessian Matrix.

In [80]:
# --- Newton's Method Implementation ---

def newtons_method(X, y, n_iterations=20, alpha=0.5):
    m, n = X.shape
    weights = np.zeros((n, 1)) # Initialize weights to zero
    loss_history = []

    for i in range(n_iterations):
        # 1. Make predictions
        y_pred = sigmoid(X @ weights)

        # 2. Calculate loss
        loss = compute_loss(y, y_pred)
        loss_history.append(loss)

        # 3. Calculate gradient
        gradient = (1/m) * X.T @ (y_pred - y)
        
        # 4. Calculate Hessian
        W = np.diag((y_pred * (1 - y_pred)).ravel())
        H = (1/m) * X.T @ W @ X

        # 5. Update weights
        weights -= alpha*np.linalg.inv(H) @ gradient

    return weights, loss_history

# Run Newton's method on the first dataset
nm_weights, nm_loss_history = newtons_method(X, y, n_iterations=20)
print(f"Final Newton's Method Loss: {nm_loss_history[-1]:.4f}")

Final Newton's Method Loss: 0.0001
