<a href="https://colab.research.google.com/github/adellabr/Model_quality_estimation/blob/main/ML3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Answer to the questions**

##### **1. What is leave-one-out? Provide limitations and strong sides**

**Leave-one-out** - это простая кросс-валидация. Каждый обучающий набор создается путем взятия всех образцов, кроме одного, а тестовый набор - это оставленный образец. Выборка разбивается на n − 1 и 1 объект n раз.
* Преимущества LOO в том, что каждый объект ровно один раз участвует в контроле, а длина обучающих подвыборок лишь на единицу меньше длины полной выборки, поэтому метод хорошо подходит для небольших выборок.
* Недостатком LOO является большая ресурсоёмкость, так как обучаться приходится L раз.

##### **2. How Grid Search, Randomized Grid Search and Bayesian optimization works?**

* Grid Search –  метод предполагает задание наборов значений для каждого гиперпараметра, которые затем "перебираются" систематически для нахождения наилучшей комбинации. Для каждой комбинации гиперпараметров производится обучение модели и оценка её производительности на валидационных данных.
* Randomized Grid Search - это метод, который выбирает случайные комбинации параметров из заданного пространства параметров и оценивает их, чтобы найти наиболее оптимальное решение.
* Bayesian optimization - на каждой итерации метод указывает, в какой следующей точке мы с наибольшей вероятностью улучшим нашу текущую оценку оптимума

##### **3. Explain the classification of feature selection methods. Explain how Pearson and Chi2 works. Explain how Lasso works. Explain what is permutation importance. Get acquainted with SHAP**

**Supervised**
* **Методы фильтрации** (filter methods) основаны на статистических методах и, как правило, рассматривают каждую фичу независимо. Позволяют оценить и ранжировать фичи по значимости, за которую принимается степень корреляции этой фичи с целевой переменной.
* **Wrapper methods.** Суть этой категории методов в том, что классификатор запускается на разных подмножествах фич исходного тренировочного сета. После чего выбирается подмножество фич с наилучшими параметрами на обучающей выборке.
 - Есть два подхода в этом классе методов — методы включения (forward selection) и исключения (backwards selection) фич.
* **Встроенные методы (embedded methods)** позволяют не разделять отбор фич и обучение классификатора, а производят отбор внутри процесса расчета модели. Основным методом из этой категории является регуляризация.

**Unsupervised**
* Zero or near-zero variance
* Many missing values
* High multicollinearity



**Критерий хи-квадрат (Chi-squared score)**: Проверяет, есть ли значимая разница между наблюдаемой и ожидаемой частотами двух категориальных переменных. Таким образом, проверяется нулевая гипотеза об отсутствии связи между двумя переменными. Используется для категориальных признаков в датасете. Мы вычисляем хи-квадрат между каждым признаком и целью, после выбираем желаемое количество “фич” с лучшими показателями.

**Метод Пирсона** - это корреляционный метод, который измеряет линейную зависимость между двумя переменными. Чем ближе значение коэффициента корреляции (от -1 до 1) к 1 (или -1), тем сильнее зависимость.

**Вычисление важности признаков (Permutation feature importance)** ​​включает в себя случайное перемешивание значений одного признака и наблюдение за результирующим ухудшением оценки модели.

**SHAP (SHapley Additive exPlanations)** - это метод интерпретации машинного обучения, который объясняет вклад каждого признака в предсказание конкретного наблюдения.

## **2. Preprocessing**

In [627]:
import pandas as pd
import numpy as np
from google.colab import drive
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
import numbers
from sklearn.utils import indexable
from sklearn.model_selection import KFold
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import root_mean_squared_error as RMSE
from sklearn.metrics import mean_absolute_error as MAE
import shap
import time
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
!pip install optuna
import optuna
from sklearn.model_selection import cross_val_score



In [628]:
drive.mount ('/content/drive', force_remount=True)

Mounted at /content/drive


In [629]:
df = pd.read_json('/content/drive/My Drive/Colab Notebooks/data/train.json')

In [630]:
decode = ['low', 'medium', 'high']
encoder = OrdinalEncoder(categories=[decode])
df['interest_level'] = encoder.fit_transform(df[['interest_level']])
df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,1.0
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,0.0
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,1.0
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,1.0
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,0.0


In [631]:
df['features'] = df['features'].str.join(',')
df['features'] = df['features'].replace(['"', "'", ' '], "", regex=True)

In [632]:
data = df[['price', 'bathrooms', 'bedrooms', 'interest_level']].copy()

features = ['Elevator', 'HardwoodFloors', 'CatsAllowed', 'DogsAllowed', 'Doorman',
            'Dishwasher', 'NoFee', 'LaundryinBuilding', 'FitnessCenter', 'Pre-War',
            'LaundryinUnit', 'RoofDeck', 'OutdoorSpace', 'DiningRoom', 'HighSpeedInternet',
            'Balcony', 'SwimmingPool', 'LaundryInBuilding', 'NewConstruction', 'Terrace']
for feat in features:
  if feat != '':
    data[feat] = df['features'].str.contains(feat).astype(np.int8)

y = data['price']
X = data.drop(columns='price')

In [633]:
X.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,1.0,1,1.0,0,1,1,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
6,1.0,2,0.0,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9,1.0,2,1.0,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
10,1.5,3,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,1.0,0,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


## **3. Implementation of splitting data methods**

##### **1. Split data into 2 parts randomly with parameter test_size (ratio from 0 to 1), return training and test samples.**

In [634]:
def MySplitTrainTest(*arrays, test_size=0.25, shuffle=True, random_state=None):
  if len(arrays) < 1:
    raise ValueError("Array not correct")
  if test_size <= 0 or test_size >= 1:
    raise ValueError("test_size not correct")

  n_samples = arrays[0].shape[0]
  for arr in arrays:
    if arr.shape[0] != n_samples:
      raise ValueError("Array not correct")

  n_test = int(n_samples * test_size)
  indices = arrays[0].index.tolist()

  if shuffle:
    if random_state:
      np.random.seed(random_state)
    np.random.shuffle(indices)

  test_indices = indices[:n_test]
  train_indices = indices[n_test:]

  sets = []
  for arr in arrays:
    sets.append(arr.loc[train_indices])
    sets.append(arr.loc[test_indices])

  return *sets,


In [635]:
x_train_my, x_test_my, y_train_my, y_test_my = MySplitTrainTest(X, y, test_size=0.25, random_state=42)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [636]:
x_train_my.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
69750,1.0,3,1.0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
94734,1.0,2,0.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
87560,1.0,1,0.0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
78008,1.0,1,1.0,1,1,0,0,1,1,1,...,1,0,1,1,0,1,0,0,0,1
32748,1.0,0,0.0,1,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0


In [637]:
x_train.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
69750,1.0,3,1.0,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
94734,1.0,2,0.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
87560,1.0,1,0.0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
78008,1.0,1,1.0,1,1,0,0,1,1,1,...,1,0,1,1,0,1,0,0,0,1
32748,1.0,0,0.0,1,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0


In [638]:
y_train_my.head()

Unnamed: 0,price
69750,3700
94734,1575
87560,3250
78008,4300
32748,2482


In [639]:
y_train.head()

Unnamed: 0,price
69750,3700
94734,1575
87560,3250
78008,4300
32748,2482


##### **2. Randomly split data into 3 parts with parameters validation_size and test_size, return train, validation and test samples.**

In [640]:
def SplitTrainValidTest(*arrays, validation_size=0.25, test_size=0.25, shuffle=True, random_state=None):
  if len(arrays) < 1:
    raise ValueError("Array not correct")
  if test_size <= 0 or test_size >= 1:
    raise ValueError("test_size not correct")
  if validation_size <= 0 or validation_size >= 1:
    raise ValueError("validation_size not correct")

  n_samples = arrays[0].shape[0]
  for arr in arrays:
    if arr.shape[0] != n_samples:
      raise ValueError("Array not correct")

  n_test = int(n_samples * test_size)
  n_valid = n_test + int(n_samples * validation_size)
  indices = arrays[0].index.tolist()

  if shuffle:
    if random_state:
      np.random.seed(random_state)
    np.random.shuffle(indices)

  test_indices = indices[:n_test]
  valid_indices = indices[n_test:n_valid]
  train_indices = indices[n_valid:]

  sets = []
  for arr in arrays:
    sets.append(arr.loc[train_indices])
    sets.append(arr.loc[valid_indices])
    sets.append(arr.loc[test_indices])

  return *sets,


In [641]:
data_train, data_valid, data_test = SplitTrainValidTest(X, random_state=42)

##### **3. Split data into 2 parts with parameter date_split, return train and test samples split by date_split param.**

In [642]:
X['created'] = df['created'].copy()

In [643]:
def SplitByDateTrainTest(data, date_split, date_column):
  if date_column not in data:
    raise ValueError("date_column not correct")

  data[date_column] = pd.to_datetime(data[date_column])

  train = data[data[date_column] < date_split]
  test = data[data[date_column] >= date_split]

  return train, test

In [644]:
train_date, test_date = SplitByDateTrainTest(X, '2016-06-01 05:44:33', 'created')

In [645]:
train_date.sort_values('created').tail(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
9476,3.0,4,0.0,1,1,1,1,1,1,0,...,0,0,1,0,0,0,0,0,0,2016-06-01 05:44:06


In [646]:
test_date.sort_values('created').head(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
6,1.0,2,0.0,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,2016-06-01 05:44:33


##### **4. Split data into 3 parts with parameters validation_date and test_date, return train, validation and test samples split by input params.**

In [647]:
def SplitByDateTrainValidTest(data, validation_date, test_date, date_column):
  if date_column not in data:
    raise ValueError("date_column not correct")

  data[date_column] = pd.to_datetime(data[date_column])

  train = data[data[date_column] < validation_date]
  valid = data[(data[date_column] >= validation_date) & (data[date_column] < test_date)]
  test = data[data[date_column] >= test_date]

  return train, valid, test

In [648]:
train_date2, valid_date2, test_date2 = SplitByDateTrainValidTest(X, '2016-05-01 20:30:00', '2016-06-01 05:44:33', 'created')

In [649]:
train_date2.sort_values('created').tail(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
93890,1.0,1,1.0,1,0,1,1,1,0,0,...,0,0,0,0,0,0,1,0,0,2016-04-30 19:21:03


In [650]:
valid_date2.sort_values('created').head(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
67212,1.0,2,0.0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,2016-05-01 22:36:52


In [651]:
valid_date2.sort_values('created').tail(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
9476,3.0,4,0.0,1,1,1,1,1,1,0,...,0,0,1,0,0,0,0,0,0,2016-06-01 05:44:06


In [652]:
test_date2.sort_values('created').head(1)

Unnamed: 0,bathrooms,bedrooms,interest_level,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace,created
6,1.0,2,0.0,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,2016-06-01 05:44:33


## **4. Implement the next cross-validation methods and 5. Cross-validation comparison**

##### **1. K-Fold, where k is the input parameter, returns a list of train and test indices.**

In [653]:
class K_Fold:
  def __init__(self, k=5, shuffle=False, random_state=None):
    if not isinstance(k, numbers.Integral):
      raise ValueError("The number of folds must be of Int")
    if k <= 1:
      raise ValueError('k-fold cross-validation requires k >=2')

    self.k = k
    self.shuffle = shuffle
    self.random_state = random_state


  def split(self, data):
    n_samples = data.shape[0]
    if self.k > n_samples:
      raise ValueError("Cannot have number of splits greater than n_samples")

    residue = n_samples % self.k
    indices = list(range(n_samples))

    if self.shuffle:
      if self.random_state:
        np.random.seed(self.random_state)
      np.random.shuffle(indices)

    start_ind = 0
    for i in range(self.k):
      fold_size = n_samples // self.k
      if i < residue:
        fold_size +=1

      end_ind = start_ind + fold_size
      test_index = indices[start_ind:end_ind]
      train_index = indices[:start_ind] + indices[end_ind:]
      start_ind += fold_size

      yield np.sort(np.array(train_index)), np.sort(np.array(test_index))

In [654]:
kf = KFold(n_splits=3)

for train_index, test_index in kf.split(X):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [16451 16452 16453 ... 49349 49350 49351]
test: [    0     1     2 ... 16448 16449 16450]

train: [    0     1     2 ... 49349 49350 49351]
test: [16451 16452 16453 ... 32899 32900 32901]

train: [    0     1     2 ... 32899 32900 32901]
test: [32902 32903 32904 ... 49349 49350 49351]



In [655]:
kf_my = K_Fold(k=3)

for train_index, test_index in kf_my.split(X):
    X_train_kfold_my, X_test1_kfold_my = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [16451 16452 16453 ... 49349 49350 49351]
test: [    0     1     2 ... 16448 16449 16450]

train: [    0     1     2 ... 49349 49350 49351]
test: [16451 16452 16453 ... 32899 32900 32901]

train: [    0     1     2 ... 32899 32900 32901]
test: [32902 32903 32904 ... 49349 49350 49351]



##### **2. Grouped K-Fold, where k and group_field are input parameters, returns list of train and test indices.**

In [656]:
class Group_K_Fold:
  def __init__(self, k=5, shuffle=False, random_state=None):
    if not isinstance(k, numbers.Integral):
      raise ValueError("The number of folds must be of Int")
    if k <= 1:
      raise ValueError('k-fold cross-validation requires k >=2')

    self.k = k
    self.shuffle = shuffle
    self.random_state = random_state


  def split(self, X=None, y=None, group_field=None):
    n_samples = X.shape[0]
    if y and y.shape[0] != n_samples:
      raise ValueError("n_samples y doesn't match")

    groups = np.unique(group_field)
    if self.k > len(groups):
      raise ValueError("Cannot have number of splits greater than the number of groups")

    if self.shuffle:
      if self.random_state:
        np.random.seed(self.random_state)
      np.random.shuffle(groups)

    folds = np.array_split(groups, self.k)

    for fold in folds:
      test = np.isin(group_field, fold)
      train = np.isin(group_field, fold, invert=True)

      yield np.sort(np.where(train)[0]), np.sort(np.where(test)[0])

In [657]:
gkf = GroupKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in gkf.split(X, groups=X['interest_level']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print(train_index)

[    0     2     3 ... 49349 49350 49351]
[    1     4     5 ... 49346 49347 49351]
[    0     1     2 ... 49348 49349 49350]


In [658]:
gkf_my = Group_K_Fold(k=3, shuffle=True, random_state=42)

for train_index, test_index in gkf_my.split(X, group_field=X['interest_level']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print(train_index)

[    0     2     3 ... 49349 49350 49351]
[    1     4     5 ... 49346 49347 49351]
[    0     1     2 ... 49348 49349 49350]


##### **3. Stratified K-fold, where k and stratify_field are input parameters, returns list of train and test indices.**

In [659]:
class Stratified_K_Fold:
  def __init__(self, k=5, shuffle=False, random_state=None):
    if not isinstance(k, numbers.Integral):
      raise ValueError("The number of folds must be of Int")
    if k <= 1:
      raise ValueError('k-fold cross-validation requires k >=2')

    self.k = k
    self.shuffle = shuffle
    self.random_state = random_state


  def split(self, stratify_field=None):
    n_samples = len(stratify_field)
    if self.k > n_samples:
      raise ValueError("Cannot have number of splits greater than the number of samples")

    groups, group_count = np.unique(stratify_field, return_counts=True)
    if any(self.k > count for count in group_count):
      raise ValueError("Cannot have number of splits greater than the number of samples in each group")

    if self.shuffle and self.random_state:
      np.random.seed(self.random_state)

    folds_test = [[] for i in range(self.k)]
    for group in groups:
      group_indices = np.where(stratify_field == group)[0]
      if self.shuffle:
        np.random.shuffle(group_indices)
      residue = len(group_indices) % self.k

      start_ind = 0
      for i in range(self.k):
        fold_size = len(group_indices) // self.k
        if i < residue:
          fold_size +=1
        end_ind = start_ind + fold_size
        folds_test[i].extend(group_indices[start_ind:end_ind])
        start_ind += fold_size

    for test_index in folds_test:
      train_index = np.where(~np.isin(np.arange(n_samples), test_index))[0]
      yield np.sort(train_index), np.sort(np.array(test_index))

In [660]:
skf = StratifiedKFold(n_splits=3)

for train_index, test_index in skf.split(X, y=X['Elevator']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [16082 16083 16087 ... 49349 49350 49351]
test: [    0     1     2 ... 16805 16806 16807]

train: [    0     1     2 ... 49349 49350 49351]
test: [16082 16083 16087 ... 32934 32935 32937]

train: [    0     1     2 ... 32934 32935 32937]
test: [32847 32849 32850 ... 49349 49350 49351]



In [661]:
skf_my = Stratified_K_Fold(k=3)

for train_index, test_index in skf_my.split(stratify_field=X['Elevator']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [16082 16083 16087 ... 49349 49350 49351]
test: [    0     1     2 ... 16805 16806 16807]

train: [    0     1     2 ... 49349 49350 49351]
test: [16082 16083 16087 ... 32934 32935 32937]

train: [    0     1     2 ... 32934 32935 32937]
test: [32847 32849 32850 ... 49349 49350 49351]



##### **4. Time series split, where k and date_field are input parameters, returns list of train and test indices.**

In [662]:
class Time_Series_Split:
  def __init__(self, k=5, shuffle=False, random_state=None):
    if not isinstance(k, numbers.Integral):
      raise ValueError("The number of folds must be of Int")
    if k <= 1:
      raise ValueError('k-fold cross-validation requires k >=2')

    self.k = k
    self.shuffle = shuffle
    self.random_state = random_state


  def split(self, date_field):
    n_samples = len(date_field)
    if self.k > n_samples:
      raise ValueError("Cannot have number of splits greater than the number of samples")

    indices = list(range(n_samples))
    test_size = n_samples // (self.k + 1)

    for i in range(self.k):
      train_size = i * n_samples // (self.k + 1) + n_samples % (self.k + 1)
      train_index = indices[:train_size]
      test_index = indices[train_size : train_size + test_size * (i + 1)]

      yield np.sort(np.array(train_index)), np.sort(np.array(test_index))

In [663]:
tkf = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tkf.split(X, y=X['Elevator']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [    0     1     2 ... 12335 12336 12337]
test: [12338 12339 12340 ... 24673 24674 24675]

train: [    0     1     2 ... 24673 24674 24675]
test: [24676 24677 24678 ... 37011 37012 37013]

train: [    0     1     2 ... 37011 37012 37013]
test: [37014 37015 37016 ... 49349 49350 49351]



In [664]:
tkf_my = Time_Series_Split(k=3)

for train_index, test_index in tkf.split(X['Elevator']):
    X_train_kfold, X_test_kfold = X.iloc[train_index], X.iloc[test_index]
    print('train:', train_index)
    print('test:', test_index)
    print()

train: [    0     1     2 ... 12335 12336 12337]
test: [12338 12339 12340 ... 24673 24674 24675]

train: [    0     1     2 ... 24673 24674 24675]
test: [24676 24677 24678 ... 37011 37012 37013]

train: [    0     1     2 ... 37011 37012 37013]
test: [37014 37015 37016 ... 49349 49350 49351]



## **6. Feature Selection**

##### **1. Fit a Lasso regression model with normalized features. Use your method for splitting samples into 3 parts by field created with 60/20/20 ratio — train/validation/test.**

In [665]:
X = X.drop(columns=['created'])

In [666]:
train, valid, test = SplitTrainValidTest(data, validation_size=0.2, test_size=0.20, shuffle=True, random_state=42)

y_train = train['price'].copy()
X_train = train.drop(columns=['price'])

y_valid = valid['price'].copy()
X_valid = valid.drop(columns=['price'])

y_test = test['price'].copy()
X_test = test.drop(columns=['price'])

In [667]:
MinMax = MinMaxScaler()

X_train_norm = pd.DataFrame(MinMax.fit_transform(X_train), columns=X_train.columns)
X_valid_norm = pd.DataFrame(MinMax.transform(X_valid), columns=X_valid.columns)
X_test_norm = pd.DataFrame(MinMax.transform(X_test), columns=X_test.columns)

In [668]:
start_time_lasso = time.time()

lasso = Lasso()
lasso.fit(X_train_norm, y_train)

end_time_lasso = time.time()
lasso_time = end_time_lasso - start_time_lasso

y_pred_train = lasso.predict(X_train_norm)
y_pred_valid = lasso.predict(X_valid_norm)
y_pred_test = lasso.predict(X_test_norm)

In [669]:
result_MAE = pd.DataFrame(columns=['model', 'train', 'valid', 'test'])
result_RMSE = pd.DataFrame(columns=['model', 'train', 'valid', 'test'])
result_r2_score = pd.DataFrame(columns=['model', 'train', 'valid', 'test'])
work_time = pd.DataFrame(columns=['method', 'time'])

In [670]:
result_MAE.loc[len(result_MAE)] = ['Lasso Normalized', MAE(y_train, y_pred_train), MAE(y_valid, y_pred_valid), MAE(y_test, y_pred_test)]
result_RMSE.loc[len(result_RMSE)] = ['Lasso Normalized', RMSE(y_train, y_pred_train), RMSE(y_valid, y_pred_valid), RMSE(y_test, y_pred_test)]
result_r2_score.loc[len(result_r2_score)] = ['Lasso Normalized', r2_score(y_train, y_pred_train), r2_score(y_valid, y_pred_valid), r2_score(y_test, y_pred_test)]

##### **2. Sort features by weight coefficients from model, fit model to top 10 features and compare quality.**

In [671]:
start_time_coef = time.time()

w_coef = lasso.coef_
feature_coef = pd.DataFrame({'features': X_train_norm.columns, 'w_coef_abs': np.abs(w_coef)})
top10_features = feature_coef.sort_values(by=['w_coef_abs'], ascending=False)[:10]

end_time_coef = time.time()
work_time.loc[len(work_time)] = ['Coefficients', lasso_time + end_time_coef - start_time_coef]

In [672]:
feature_coef.sort_values(by=['w_coef_abs'], ascending=False)

Unnamed: 0,features,w_coef_abs
0,bathrooms,22697.210392
1,bedrooms,3759.249418
7,Doorman,1084.970093
2,interest_level,976.963571
13,LaundryinUnit,543.401407
10,LaundryinBuilding,329.406192
20,LaundryInBuilding,289.205521
22,Terrace,281.385751
17,HighSpeedInternet,268.535118
6,DogsAllowed,254.227068


In [673]:
top10_features

Unnamed: 0,features,w_coef_abs
0,bathrooms,22697.210392
1,bedrooms,3759.249418
7,Doorman,1084.970093
2,interest_level,976.963571
13,LaundryinUnit,543.401407
10,LaundryinBuilding,329.406192
20,LaundryInBuilding,289.205521
22,Terrace,281.385751
17,HighSpeedInternet,268.535118
6,DogsAllowed,254.227068


In [674]:
lasso_top10 = Lasso()
lasso_top10.fit(X_train_norm[top10_features['features']], y_train)
y_pred_train_top10 = lasso_top10.predict(X_train_norm[top10_features['features']])
y_pred_valid_top10 = lasso_top10.predict(X_valid_norm[top10_features['features']])
y_pred_test_top10 = lasso_top10.predict(X_test_norm[top10_features['features']])

In [675]:
result_MAE.loc[len(result_MAE)] = ['Lasso Normalized Top10', MAE(y_train, y_pred_train_top10), MAE(y_valid, y_pred_valid_top10), MAE(y_test, y_pred_test_top10)]
result_RMSE.loc[len(result_RMSE)] = ['Lasso Normalized Top10', RMSE(y_train, y_pred_train_top10), RMSE(y_valid, y_pred_valid_top10), RMSE(y_test, y_pred_test_top10)]
result_r2_score.loc[len(result_r2_score)] = ['Lasso Normalized Top10', r2_score(y_train, y_pred_train_top10), r2_score(y_valid, y_pred_valid_top10), r2_score(y_test, y_pred_test_top10)]

In [676]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018


In [677]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202


In [678]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705


##### **3. Implement method for simple feature selection by nan-ratio in feature and correlation. Apply this method to feature set and take top 10 features, refit model and measure quality.**

In [679]:
def SelectByCorrelation(X, y, n_top_features=10):
  data = X.copy()
  data['target'] = y.copy()
  correlation = data.corr()['target'].abs()
  return correlation.sort_values(ascending=False).head(n_top_features + 1).index[1:].tolist()

In [680]:
start_time_corr = time.time()

corr_top10 = SelectByCorrelation(X_train_norm, y_train, 10)

end_time_corr = time.time()
work_time.loc[len(work_time)] = ['Correlation', end_time_corr - start_time_corr]

In [681]:
lasso_corr = Lasso()
lasso_corr.fit(X_train_norm[corr_top10], y_train)

y_pred_train_corr = lasso_corr.predict(X_train_norm[corr_top10])
y_pred_valid_corr = lasso_corr.predict(X_valid_norm[corr_top10])
y_pred_test_corr = lasso_corr.predict(X_test_norm[corr_top10])

In [682]:
result_MAE.loc[len(result_MAE)] = ['Lasso Normalized Correlation', MAE(y_train, y_pred_train_corr), MAE(y_valid, y_pred_valid_corr), MAE(y_test, y_pred_test_corr)]
result_RMSE.loc[len(result_RMSE)] = ['Lasso Normalized Correlation', RMSE(y_train, y_pred_train_corr), RMSE(y_valid, y_pred_valid_corr), RMSE(y_test, y_pred_test_corr)]
result_r2_score.loc[len(result_r2_score)] = ['Lasso Normalized Correlation', r2_score(y_train, y_pred_train_corr), r2_score(y_valid, y_pred_valid_corr), r2_score(y_test, y_pred_test_corr)]

In [683]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018
2,Lasso Normalized Correlation,1310.253559,1382.302796,1711.184254


In [684]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202
2,Lasso Normalized Correlation,9349.917358,11107.346022,45213.19538


In [685]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705
2,Lasso Normalized Correlation,0.009511,0.006913,0.000681


##### **4. Implement permutation importance method and take top 10 features, refit model and measure qualit**

In [686]:
def Permutation_Impotance(estimator, X, y, n_repeats=5, random_state=None):
  if random_state:
    np.random.seed(random_state)

  start_score = MAE(y, estimator.predict(X))
  importance_mean = []

  for col in X.columns:
    original_values = X[col].copy()
    score_diff = []

    for i in range(n_repeats):
      X[col] = np.random.permutation(X[col])
      new_score = MAE(y, estimator.predict(X))
      score_diff.append(abs(start_score - new_score))

    X[col] = original_values
    importance_mean.append(np.mean(score_diff))

  return pd.DataFrame({'features': X.columns, 'mean_impotances': importance_mean})

In [687]:
start_time_perm = time.time()

permutation_imp = Permutation_Impotance(lasso, X_train_norm, y_train, random_state=42)
top10_perm_imp = permutation_imp.sort_values(by=['mean_impotances'], ascending=False)[:10]['features']

end_time_perm = time.time()
work_time.loc[len(work_time)] = ['Permutation Impotance', lasso_time + end_time_perm - start_time_perm]

In [688]:
permutation_imp.sort_values(by=['mean_impotances'], ascending=False)

Unnamed: 0,features,mean_impotances
0,bathrooms,410.978207
1,bedrooms,180.642789
7,Doorman,175.696191
2,interest_level,52.323497
13,LaundryinUnit,30.778635
10,LaundryinBuilding,13.337313
6,DogsAllowed,4.391601
17,HighSpeedInternet,4.097585
3,Elevator,3.1421
5,CatsAllowed,2.54589


In [689]:
lasso_perm = Lasso()
lasso_perm.fit(X_train_norm[top10_perm_imp], y_train)

y_pred_train_perm = lasso_perm.predict(X_train_norm[top10_perm_imp])
y_pred_valid_perm = lasso_perm.predict(X_valid_norm[top10_perm_imp])
y_pred_test_perm = lasso_perm.predict(X_test_norm[top10_perm_imp])

In [690]:
result_MAE.loc[len(result_MAE)] = ['Lasso Permutation', MAE(y_train, y_pred_train_perm), MAE(y_valid, y_pred_valid_perm), MAE(y_test, y_pred_test_perm)]
result_RMSE.loc[len(result_RMSE)] = ['Lasso Permutation', RMSE(y_train, y_pred_train_perm), RMSE(y_valid, y_pred_valid_perm), RMSE(y_test, y_pred_test_perm)]
result_r2_score.loc[len(result_r2_score)] = ['Lasso Permutation', r2_score(y_train, y_pred_train_perm), r2_score(y_valid, y_pred_valid_perm), r2_score(y_test, y_pred_test_perm)]

In [691]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018
2,Lasso Normalized Correlation,1310.253559,1382.302796,1711.184254
3,Lasso Permutation,930.237466,1013.966195,1322.545868


In [692]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202
2,Lasso Normalized Correlation,9349.917358,11107.346022,45213.19538
3,Lasso Permutation,9234.37841,11016.764484,45190.027055


In [693]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705
2,Lasso Normalized Correlation,0.009511,0.006913,0.000681
3,Lasso Permutation,0.033839,0.023045,0.001705


##### **5. Import Shap and also refit model on top 10 features.**

In [694]:
start_time_shap = time.time()

explainer = shap.Explainer(lasso, X_train_norm)
mean_shap_values = np.abs(np.mean(explainer.shap_values(X_valid_norm), axis=0))
shap_faetures = pd.DataFrame({'features': X_train_norm.columns, 'mean_shap_values': mean_shap_values})
top10_shap = shap_faetures.sort_values(by=['mean_shap_values'], ascending=False)[:10]['features']

end_time_shap = time.time()
work_time.loc[len(work_time)] = ['Shap', lasso_time + end_time_shap - start_time_shap]

shap_faetures.sort_values(by=['mean_shap_values'], ascending=False)

Unnamed: 0,features,mean_shap_values
0,bathrooms,159.041446
1,bedrooms,54.979499
7,Doorman,36.132692
13,LaundryinUnit,22.5674
6,DogsAllowed,16.392108
9,NoFee,11.566031
2,interest_level,11.20984
4,HardwoodFloors,7.733132
10,LaundryinBuilding,5.683675
15,OutdoorSpace,4.185366


In [695]:
lasso_shap = Lasso()
lasso_shap.fit(X_train_norm[top10_shap], y_train)

y_pred_train_shap = lasso_shap.predict(X_train_norm[top10_shap])
y_pred_valid_shap = lasso_shap.predict(X_valid_norm[top10_shap])
y_pred_test_shap = lasso_shap.predict(X_test_norm[top10_shap])

In [696]:
result_MAE.loc[len(result_MAE)] = ['Lasso Normalized Shap', MAE(y_train, y_pred_train_shap), MAE(y_valid, y_pred_valid_shap), MAE(y_test, y_pred_test_shap)]
result_RMSE.loc[len(result_RMSE)] = ['Lasso Normalized Shap', RMSE(y_train, y_pred_train_shap), RMSE(y_valid, y_pred_valid_shap), RMSE(y_test, y_pred_test_shap)]
result_r2_score.loc[len(result_r2_score)] = ['Lasso Normalized Shap', r2_score(y_train, y_pred_train_shap), r2_score(y_valid, y_pred_valid_shap), r2_score(y_test, y_pred_test_shap)]

##### **6. Compare the quality of these methods for different aspects — speed, metrics and stability.**

In [697]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018
2,Lasso Normalized Correlation,1310.253559,1382.302796,1711.184254
3,Lasso Permutation,930.237466,1013.966195,1322.545868
4,Lasso Normalized Shap,938.440281,1019.911953,1330.802027


In [698]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202
2,Lasso Normalized Correlation,9349.917358,11107.346022,45213.19538
3,Lasso Permutation,9234.37841,11016.764484,45190.027055
4,Lasso Normalized Shap,9234.206211,11016.246803,45189.678048


In [699]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705
2,Lasso Normalized Correlation,0.009511,0.006913,0.000681
3,Lasso Permutation,0.033839,0.023045,0.001705
4,Lasso Normalized Shap,0.033875,0.023137,0.00172


In [700]:
work_time

Unnamed: 0,method,time
0,Coefficients,0.511262
1,Correlation,0.095163
2,Permutation Impotance,2.920207
3,Shap,0.513484


Наименьшее качество наблюдается у модели, обученной на признаках, выбранных по коррелляции. Значимых отличий в качестве остальных моделей нет.

По времени наиболее эффективная модель, основанная на выборе признаков по коррелляции, коэфициентам w, shap. Permutation Impotance занимает больше всего времени

## **7. Hyperparameter optimization**

##### **1. Implement grid search and random search methods for alpha and l1_ratio for sklearn's ElasticNet model. 2. Find the best combination of model hyperparameters. 3.Fit the resulting model.**

In [701]:
elastic_grid = ElasticNet()

grid_params = {
    'alpha': [0.01, 0.1, 1, 10],
    'l1_ratio': [0.2, 0.5, 0.8, 0.9]
}

grid_search = GridSearchCV(estimator=elastic_grid, param_grid=grid_params)
grid_search.fit(X_train_norm, y_train)
print("Лучшие гиперпараметры GridSearchCV: ", grid_search.best_params_)

Лучшие гиперпараметры GridSearchCV:  {'alpha': 0.01, 'l1_ratio': 0.9}


In [702]:
elastic_rand = ElasticNet()

param_dist = {
    'alpha': stats.uniform(loc=0.01, scale=50),
    'l1_ratio' : stats.uniform(loc=0.01, scale=0.1)
}

random_search = RandomizedSearchCV(estimator=elastic_rand, param_distributions=param_dist, n_iter=30, random_state=42)
random_search.fit(X_train_norm, y_train)
print("Лучшие гиперпараметры RandomizedSearchCV: ", random_search.best_params_)

Лучшие гиперпараметры RandomizedSearchCV:  {'alpha': 1.0392247147901224, 'l1_ratio': 0.10699098521619943}


##### **4. Import optuna and configure the same experiment with ElasticNet**

In [704]:
def objective(trial):
  alpha = trial.suggest_float('alpha', 0.01, 10, log=True)
  l1_ratio = trial.suggest_float('l1_ratio', 0.01, 1)

  elast_optuna = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
  elast_optuna.fit(X_train_norm, y_train)

  y_pred_valid_opt = elast_optuna.predict(X_valid_norm)
  mae = MAE(y_valid, y_pred_valid_opt)

  return mae

study = optuna.create_study()
study.optimize(objective, n_trials=50)
print("Лучшие гиперпараметры optuna", study.best_params)

[I 2025-02-17 14:27:38,804] A new study created in memory with name: no-name-a94e2d4c-206b-48b5-abdb-81a0c438522d
[I 2025-02-17 14:27:38,833] Trial 0 finished with value: 1285.766184524184 and parameters: {'alpha': 0.30597400827215954, 'l1_ratio': 0.5727211195989372}. Best is trial 0 with value: 1285.766184524184.
[I 2025-02-17 14:27:38,874] Trial 1 finished with value: 1344.9331819027054 and parameters: {'alpha': 0.29578322144341, 'l1_ratio': 0.011750674719663384}. Best is trial 0 with value: 1285.766184524184.
[I 2025-02-17 14:27:38,925] Trial 2 finished with value: 1409.5284390725963 and parameters: {'alpha': 2.7855957819408967, 'l1_ratio': 0.6840224938057815}. Best is trial 0 with value: 1285.766184524184.
[I 2025-02-17 14:27:38,989] Trial 3 finished with value: 1386.0803295912015 and parameters: {'alpha': 1.0544382567814174, 'l1_ratio': 0.4589050429694457}. Best is trial 0 with value: 1285.766184524184.
[I 2025-02-17 14:27:39,060] Trial 4 finished with value: 1167.3245237350375 an

Лучшие гиперпараметры optuna {'alpha': 0.06523946290883112, 'l1_ratio': 0.9943468073848019}


##### **5. Estimate metrics and compare approaches.**

In [705]:
y_pred_train_grid = grid_search.best_estimator_.predict(X_train_norm)
y_pred_valid_grid = grid_search.best_estimator_.predict(X_valid_norm)
y_pred_test_grid = grid_search.best_estimator_.predict(X_test_norm)

result_MAE.loc[len(result_MAE)] = ['ElasticNet grid search', MAE(y_train, y_pred_train_grid), MAE(y_valid, y_pred_valid_grid), MAE(y_test, y_pred_test_grid)]
result_RMSE.loc[len(result_RMSE)] = ['ElasticNet grid search', RMSE(y_train, y_pred_train_grid), RMSE(y_valid, y_pred_valid_grid), RMSE(y_test, y_pred_test_grid)]
result_r2_score.loc[len(result_r2_score)] = ['ElasticNet grid search', r2_score(y_train, y_pred_train_grid), r2_score(y_valid, y_pred_valid_grid), r2_score(y_test, y_pred_test_grid)]

In [706]:
y_pred_train_rand = random_search.best_estimator_.predict(X_train_norm)
y_pred_valid_rand = random_search.best_estimator_.predict(X_valid_norm)
y_pred_test_rand = random_search.best_estimator_.predict(X_test_norm)

result_MAE.loc[len(result_MAE)] = ['ElasticNet random search', MAE(y_train, y_pred_train_rand), MAE(y_valid, y_pred_valid_rand), MAE(y_test, y_pred_test_rand)]
result_RMSE.loc[len(result_RMSE)] = ['ElasticNet random search', RMSE(y_train, y_pred_train_rand), RMSE(y_valid, y_pred_valid_rand), RMSE(y_test, y_pred_test_rand)]
result_r2_score.loc[len(result_r2_score)] = ['ElasticNet random search', r2_score(y_train, y_pred_train_rand), r2_score(y_valid, y_pred_valid_rand), r2_score(y_test, y_pred_test_rand)]

In [707]:
elast_opt = ElasticNet(alpha=study.best_params['alpha'], l1_ratio=study.best_params['l1_ratio'])
elast_opt.fit(X_train_norm, y_train)

y_pred_train_opt = elast_opt.predict(X_train_norm)
y_pred_valid_opt = elast_opt.predict(X_valid_norm)
y_pred_test_opt = elast_opt.predict(X_test_norm)

result_MAE.loc[len(result_MAE)] = ['ElasticNet Optuna', MAE(y_train, y_pred_train_opt), MAE(y_valid, y_pred_valid_opt), MAE(y_test, y_pred_test_opt)]
result_RMSE.loc[len(result_RMSE)] = ['ElasticNet Optuna', RMSE(y_train, y_pred_train_opt), RMSE(y_valid, y_pred_valid_opt), RMSE(y_test, y_pred_test_opt)]
result_r2_score.loc[len(result_r2_score)] = ['ElasticNet Optuna', r2_score(y_train, y_pred_train_opt), r2_score(y_valid, y_pred_valid_opt), r2_score(y_test, y_pred_test_opt)]

In [708]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018
2,Lasso Normalized Correlation,1310.253559,1382.302796,1711.184254
3,Lasso Permutation,930.237466,1013.966195,1322.545868
4,Lasso Normalized Shap,938.440281,1019.911953,1330.802027
5,ElasticNet grid search,946.951828,1026.303907,1339.544644
6,ElasticNet random search,1336.620035,1411.304743,1734.561839
7,ElasticNet Optuna,938.968482,1019.620589,1330.617501


In [709]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202
2,Lasso Normalized Correlation,9349.917358,11107.346022,45213.19538
3,Lasso Permutation,9234.37841,11016.764484,45190.027055
4,Lasso Normalized Shap,9234.206211,11016.246803,45189.678048
5,ElasticNet grid search,9239.030512,11019.706459,45186.007264
6,ElasticNet random search,9371.07606,11126.280084,45224.134577
7,ElasticNet Optuna,9234.199319,11015.96522,45186.38531


In [710]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705
2,Lasso Normalized Correlation,0.009511,0.006913,0.000681
3,Lasso Permutation,0.033839,0.023045,0.001705
4,Lasso Normalized Shap,0.033875,0.023137,0.00172
5,ElasticNet grid search,0.032865,0.022523,0.001883
6,ElasticNet random search,0.005023,0.003525,0.000198
7,ElasticNet Optuna,0.033876,0.023186,0.001866


Наиболее качественный подбор гиперпараметров у optuna

##### **6. Run optuna on one of the cross-validation schemes.**

In [711]:
def objective(trial):
  alpha = trial.suggest_float('alpha', 0.01, 10, log=True)
  l1_ratio = trial.suggest_float('l1_ratio', 0.01, 1, log=True)

  elast_optuna = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
  elast_optuna.fit(X_train_norm, y_train)

  y_pred_valid_opt = elast_optuna.predict(X_valid_norm)
  cv = cross_val_score(elast_optuna, X_train_norm, y_train)

  return np.mean(cv)

study_cv = optuna.create_study()
study_cv.optimize(objective, n_trials=50)
print("Лучшие гиперпараметры optuna", study_cv.best_params)

[I 2025-02-17 14:27:45,923] A new study created in memory with name: no-name-8fe41cca-02df-4a1e-a165-bbec9e3dd54c
[I 2025-02-17 14:27:46,610] Trial 0 finished with value: 0.1196549368256619 and parameters: {'alpha': 0.15603728556675925, 'l1_ratio': 0.3241699400342574}. Best is trial 0 with value: 0.1196549368256619.
[I 2025-02-17 14:27:48,063] Trial 1 finished with value: 0.15772982117146783 and parameters: {'alpha': 0.04658024388445479, 'l1_ratio': 0.0337491550404783}. Best is trial 0 with value: 0.1196549368256619.
[I 2025-02-17 14:27:49,673] Trial 2 finished with value: 0.056374666745346344 and parameters: {'alpha': 1.3443443075972672, 'l1_ratio': 0.5557019231115565}. Best is trial 2 with value: 0.056374666745346344.
[I 2025-02-17 14:27:50,533] Trial 3 finished with value: 0.06673563217274654 and parameters: {'alpha': 1.2512309706297429, 'l1_ratio': 0.6599186230922575}. Best is trial 2 with value: 0.056374666745346344.
[I 2025-02-17 14:27:51,111] Trial 4 finished with value: 0.07984

Лучшие гиперпараметры optuna {'alpha': 9.931080613414005, 'l1_ratio': 0.05510787102258826}


In [712]:
elast_opt_cv = ElasticNet(alpha=study_cv.best_params['alpha'], l1_ratio=study_cv.best_params['l1_ratio'])
elast_opt_cv.fit(X_train_norm, y_train)

y_pred_train_optcv = elast_opt_cv.predict(X_train_norm)
y_pred_valid_optcv = elast_opt_cv.predict(X_valid_norm)
y_pred_test_optcv = elast_opt_cv.predict(X_test_norm)

result_MAE.loc[len(result_MAE)] = ['ElasticNet Optuna CV', MAE(y_train, y_pred_train_optcv), MAE(y_valid, y_pred_valid_optcv), MAE(y_test, y_pred_test_optcv)]
result_RMSE.loc[len(result_RMSE)] = ['ElasticNet Optuna CV', RMSE(y_train, y_pred_train_optcv), RMSE(y_valid, y_pred_valid_optcv), RMSE(y_test, y_pred_test_optcv)]
result_r2_score.loc[len(result_r2_score)] = ['ElasticNet Optuna CV', r2_score(y_train, y_pred_train_optcv), r2_score(y_valid, y_pred_valid_optcv), r2_score(y_test, y_pred_test_optcv)]

In [713]:
result_MAE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,938.558925,1020.364615,1330.574766
1,Lasso Normalized Top10,933.175407,1015.430953,1325.673018
2,Lasso Normalized Correlation,1310.253559,1382.302796,1711.184254
3,Lasso Permutation,930.237466,1013.966195,1322.545868
4,Lasso Normalized Shap,938.440281,1019.911953,1330.802027
5,ElasticNet grid search,946.951828,1026.303907,1339.544644
6,ElasticNet random search,1336.620035,1411.304743,1734.561839
7,ElasticNet Optuna,938.968482,1019.620589,1330.617501
8,ElasticNet Optuna CV,1405.178612,1480.955085,1802.654762


In [714]:
result_RMSE

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,9232.721097,11014.932442,45187.584469
1,Lasso Normalized Top10,9234.159163,11016.789936,45190.028202
2,Lasso Normalized Correlation,9349.917358,11107.346022,45213.19538
3,Lasso Permutation,9234.37841,11016.764484,45190.027055
4,Lasso Normalized Shap,9234.206211,11016.246803,45189.678048
5,ElasticNet grid search,9239.030512,11019.706459,45186.007264
6,ElasticNet random search,9371.07606,11126.280084,45224.134577
7,ElasticNet Optuna,9234.199319,11015.96522,45186.38531
8,ElasticNet Optuna CV,9390.672637,11142.727624,45229.435153


In [715]:
result_r2_score

Unnamed: 0,model,train,valid,test
0,Lasso Normalized,0.034186,0.02337,0.001813
1,Lasso Normalized Top10,0.033885,0.02304,0.001705
2,Lasso Normalized Correlation,0.009511,0.006913,0.000681
3,Lasso Permutation,0.033839,0.023045,0.001705
4,Lasso Normalized Shap,0.033875,0.023137,0.00172
5,ElasticNet grid search,0.032865,0.022523,0.001883
6,ElasticNet random search,0.005023,0.003525,0.000198
7,ElasticNet Optuna,0.033876,0.023186,0.001866
8,ElasticNet Optuna CV,0.000857,0.000576,-3.7e-05
