# Build and train logistic regression
## Purpose
In this example we will demonstrate how to:

   - Build a Coreset tree for logistic regression on a train dataset.
   - Retrieve the root Coreset from the Coreset tree using get_coreset and train a logistic regression model on it
   - Compare the model's quality to a model build on the entire dataset
   - Add to the Coreset tree additional data through partial_build
   - Train a model directly on the Coreset tree using the fit function
   - Compare again the model's quality to a model build on the entire dataset

In this example we'll be using the well-known Covertype Dataset (https://archive.ics.uci.edu/ml/datasets/covertype). We will split the data to three parts:
   - train_1 - 50% of the data
   - train_2 - 20% of the data
   - test - 30% of the data

In [1]:
import os
import warnings
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_covtype
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import balanced_accuracy_score
from dataheroes import CoresetTreeServiceDTC

from sklearn.metrics import roc_auc_score, f1_score, balanced_accuracy_score
import xgboost as xgb


## Prepare datasets

In [2]:
X, y = fetch_covtype(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f'Dimensions of the training data: {X_train.shape}')


Dimensions of the training data: (464809, 54)


In [3]:
# transform the training and test data
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [4]:
X_train.shape[0]

464809

### Training with the Full Dataset

In [5]:
full_dataset_model = xgb.XGBClassifier(random_state=42)
full_dataset_model.fit(X_train, y_train)
y_pred_full = full_dataset_model.predict(X_test)
n_samples_full = X_train.shape[0]

full_balanced = balanced_accuracy_score(y_test, y_pred_full)

print(f'Balanced Accuracy Score: {full_balanced}')


0.8296036929211656 0.9867050962430965 0.8710532430315913


In [103]:
%%timeit
full_dataset_model = LogisticRegression().fit(X_full, y_full)


25.2 s ± 1.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Training with a Coreset

In [9]:
# Build the coreset tree
service_obj = CoresetTreeServiceDTC(
                                   optimized_for='training',
                                   n_classes=7,
                                   n_instances=X_train.shape[0]
                                  )
service_obj.build(X_train, y_train)

<dataheroes.services.coreset_tree.dtc.CoresetTreeServiceDTC at 0x2119332c640>

In [18]:
# Ignore convergence warnings for logistic regression
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

# Get the coreset
coreset = service_obj.get_coreset(level=5) # level=5
indices, X_train_coreset, y_train_coreset = coreset['data']
w = coreset['w']
# Train a logistic regression model on the coreset.
coreset_model = xgb.XGBClassifier(random_state=42).fit(X_train_coreset, y_train_coreset, sample_weight=w)
y_pred_coreset = coreset_model.predict(X_test)
n_samples_coreset = y_train_coreset.shape[0]


# Evaluate model
coreset_score = balanced_accuracy_score(y_test, y_pred_coreset) # target: 0.8296036929211656


print(f"Balanced score: {coreset_score}")



Coreset balanced score (31,018 samples): 0.8195187814437122
Coreset AUC score (31,018 samples): 0.978281428112
Coreset f1-score (31,018 samples): 0.8371900897567188


In [102]:
%%timeit
coreset_model = LogisticRegression().fit(X_train_coreset, y_train_coreset, sample_weight=w)


3.39 s ± 326 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Training with a Sample

In [17]:
import random
random.seed(42)


sample_length = 68856

# Create a list of indices
indices = list(range(X_train.shape[0]))

# Get a random sample of indices
random_indices = random.sample(indices, sample_length)

# Retrieve elements from both arrays using the random indices
X_train_sample = np.array([X_train[i] for i in random_indices])
y_train_sample = np.array([y_train[i] for i in random_indices])

# train the model with the sample
sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)

# evaluate the model
sample_balanced = balanced_accuracy_score(y_test, sample_model.predict(X_test))

print(f"Balanced score): {sample_balanced}")


sample balanced (68,856 samples): 0.7843310695650707
sample AUC score (68,856 samples): 0.9820023882077724
sample f1 score (68,856 samples): 0.8565871793327194


In [19]:
%%timeit 

sample_model = xgb.XGBClassifier(random_state=42).fit(X_train_sample, y_train_sample)
