# Ridge Classifier

Ridge Classifier is a type of linear classifier that uses L2 regularization to prevent overfitting. It is particularly useful when the data has multicollinearity, as it adds a penalty to the coefficients, shrinking them towards zero. This helps in improving the generalization of the model. You can check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier) for more details.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# imports and path setup
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

import numpy as np
import tqdm
from joblib import Parallel, delayed
from sklearn.utils import shuffle
from sklearn.linear_model import RidgeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from userkits.features import *
from userkits.utils import *

## Load and Shuffle Data

In [3]:
HALF_SIZE = False

In [4]:
# load data from train and eval directories
# set half=True to resize images to half to reduce memory usage
X, y = load_train_data(data_dir='../train_data', half=HALF_SIZE)
X, y = shuffle(X, y, random_state=42)

Loading train data: 100%|██████████| 29/29 [00:22<00:00,  1.26it/s]


## Transform Data and Add Features

The steps to include new features are detailed in (the file). You can find the definitions of currently included features there.

In [5]:
def extract_features(images):
    features_list = []
    def process_image(img):
        feats = [brightness, shannon_entropy]
        # add feature functions here
        feats.extend(color_histogram(img))
        feats.extend(lbp_texture_features(img))
        feats.extend(find_mean(img))
        feats.extend(find_stddev(img))
        return feats

    features_list = Parallel(n_jobs=-1)(delayed(process_image)(img) for img in tqdm.tqdm(images, desc="Extracting features"))
    return np.array(features_list)

In [6]:
X_features = extract_features(X)
X_features.shape

Extracting features: 100%|██████████| 1483/1483 [00:30<00:00, 47.94it/s]


(1483, 530)

In [7]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

## Common Hyperparameters
- `alpha`: Controls the strength of the L2 penalty. A higher value increases regularization, reducing the model's complexity, while a lower value allows the model to fit the data more closely.  
- `fit_intercept`: determines whether to calculate the intercept for this model. If set to `False`, no intercept will be used in calculations.

In [9]:
# Split the data and train the model
X_train, X_test, y_train, y_test = train_test_split(X_features, y_encoded, test_size=0.2)  # you can change test_size
clf = RidgeClassifier()  # you can tune hyperparameters here
clf.fit(X_train, y_train)
print("Train Accuracy:", clf.score(X_train, y_train))
print("Test Accuracy:", clf.score(X_test, y_test))

TypeError: float() argument must be a string or a real number, not 'function'

## Evaluate

In [10]:
# load eval data
# set half=True to resize images to half to reduce memory usage
X_eval, file_ids = load_eval_data("../eval_data", half=HALF_SIZE) 

Loading eval data: 100%|██████████| 1486/1486 [00:24<00:00, 60.51it/s]


In [11]:
X_eval_features = extract_features(X_eval)
eval_predictions = clf.predict(X_eval_features)
print(eval_predictions[:5])

Extracting features: 100%|██████████| 1486/1486 [00:33<00:00, 44.04it/s]


NotFittedError: This RidgeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
try:
    preds = label_encoder.inverse_transform(eval_predictions)
except Exception:
    preds = eval_predictions

save_predictions(preds, file_ids, output_file='../output/ridge_predictions.csv')