# Часть 1 Бустинг (5 баллов)

В этой части будем предсказывать зарплату data scientist-ов в зависимости  от ряда факторов с помощью градиентного бустинга.

В датасете есть следующие признаки:



* work_year: The number of years of work experience in the field of data science.

* experience_level: The level of experience, such as Junior, Senior, or Lead.

* employment_type: The type of employment, such as Full-time or Contract.

* job_title: The specific job title or role, such as Data Analyst or Data Scientist.

* salary: The salary amount for the given job.

* salary_currency: The currency in which the salary is denoted.

* salary_in_usd: The equivalent salary amount converted to US dollars (USD) for comparison purposes.

* employee_residence: The country or region where the employee resides.

* remote_ratio: The percentage of remote work offered in the job.

* company_location: The location of the company or organization.

* company_size: The company's size is categorized as Small, Medium, or Large.

In [92]:
import pandas as pd

df = pd.read_csv("homework_8_boosting_clustering_tursunov/ds_salaries.csv")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


## Задание 1 (0.5 балла) Подготовка



*   Разделите выборку на train, val, test (80%, 10%, 10%)
*   Выдерите salary_in_usd в качестве таргета
*   Найдите и удалите признак, из-за которого возможен лик в данных


In [93]:
from sklearn.model_selection import train_test_split

X = df.drop('salary', axis=1) # Сильная корелляция с целевой переменной
y = df['salary_in_usd']
X = X.drop('salary_in_usd', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
print(X_train.shape, X_valid.shape, X_test.shape)
print(y_train.shape, y_valid.shape, y_test.shape)

(3004, 9) (375, 9) (376, 9)
(3004,) (375,) (376,)


## Задание 2 (0.5 балла) Линейная модель


*   Закодируйте категориальные  признаки с помощью OneHotEncoder
*   Обучите модель линейной регрессии
*   Оцените  качество через MAPE и RMSE


In [94]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = [1, 2, 3, 4, 5, 7, 8]
ohe = OneHotEncoder()
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(categories='auto'), categorical_features)], remainder='passthrough')
ct.fit(X)
X_train = ct.transform(X_train)
X_test = ct.transform(X_test)
X_valid = ct.transform(X_valid)

lr = LinearRegression()
lr.fit(X_train, y_train)

print(f'MAPE on train set: {mean_absolute_percentage_error(y_train, lr.predict(X_train)) * 100}')
print(f'MAPE on valid set: {mean_absolute_percentage_error(y_valid, lr.predict(X_valid)) * 100}')
print(f'MAPE on test set: {mean_absolute_percentage_error(y_test, lr.predict(X_test)) * 100}')

print(f'RMSE on train set: {mean_squared_error(y_train, lr.predict(X_train)) ** 0.5}')
print(f'RMSE on valid set: {mean_squared_error(y_valid, lr.predict(X_valid)) ** 0.5}')
print(f'RMSE on test set: {mean_squared_error(y_test, lr.predict(X_test)) ** 0.5}')

MAPE on train set: 30.381107727174278
MAPE on valid set: 42.840814852677624
MAPE on test set: 36.94781891868101
RMSE on train set: 44792.19637548335
RMSE on valid set: 46795.70116126261
RMSE on test set: 51629.75232224086


## Задание 3 (0.5 балла) XGboost

Начнем с библиотеки xgboost.

Обучите модель `XGBRegressor` на тех же данных, что линейную модель, подобрав оптимальные гиперпараметры (`max_depth, learning_rate, n_estimators, gamma`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

In [95]:
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import ParameterGrid

params = {
    'max_depth' : range(2, 7, 2),
    'learning_rate' : [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1],
    'n_estimators' : [50, 100, 150, 200],
    'gamma' : [0, 0.2, 0,5, 1],
    'min_child_weight': range(1, 6, 2)
}

grid = list(ParameterGrid(params))
min_rmse = 1e30
min_params = [0] * 5
for par in grid:
    model_xgb = XGBRegressor(max_depth=par['max_depth'],
                            learning_rate=par['learning_rate'],
                            n_estimators=par['n_estimators'],
                            gamma=par['gamma'],
                            min_child_weight=par['min_child_weight'],
                            seed=42)
    model_xgb.fit(X_train, y_train)
    if min_rmse > mean_squared_error(y_valid, model_xgb.predict(X_valid)) ** 0.5:
        min_rmse = mean_squared_error(y_valid, model_xgb.predict(X_valid)) ** 0.5
        min_params = [par['max_depth'],
                    par['learning_rate'],
                    par['n_estimators'],
                    par['gamma'],
                    par['min_child_weight']]
min_rmse, min_params

(44799.71446329251, [2, 0.5, 200, 0, 1])

In [96]:
%%time
model_xgb = XGBRegressor(max_depth=min_params[0],
                            learning_rate=min_params[1],
                            n_estimators=min_params[2],
                            gamma=min_params[3],
                            min_child_weight=min_params[4],
                            seed=42)
model_xgb.fit(X_train, y_train)
print('')


CPU times: total: 469 ms
Wall time: 161 ms


In [97]:
%%timeit
model_xgb.predict(X_test)

The slowest run took 4.15 times longer than the fastest. This could mean that an intermediate result is being cached.
3.36 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [98]:
print(f'MAPE on train set: {mean_absolute_percentage_error(y_train, model_xgb.predict(X_train)) * 100}')
print(f'MAPE on valid set: {mean_absolute_percentage_error(y_valid, model_xgb.predict(X_valid)) * 100}')
print(f'MAPE on test set: {mean_absolute_percentage_error(y_test, model_xgb.predict(X_test)) * 100}')

print(f'RMSE on train set: {mean_squared_error(y_train, model_xgb.predict(X_train)) ** 0.5}')
print(f'RMSE on valid set: {mean_squared_error(y_valid, model_xgb.predict(X_valid)) ** 0.5}')
print(f'RMSE on test set: {mean_squared_error(y_test, model_xgb.predict(X_test)) ** 0.5}')

MAPE on train set: 29.31453046324141
MAPE on valid set: 39.05310546117016
MAPE on test set: 35.1906017928138
RMSE on train set: 43476.11129873829
RMSE on valid set: 44799.71446329251
RMSE on test set: 51871.80100954202


## Задание 4 (1 балл) CatBoost

Теперь библиотека CatBoost.

Обучите модель `CatBoostRegressor`, подобрав оптимальные гиперпараметры (`depth, learning_rate, iterations`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

In [99]:
from catboost import CatBoostRegressor
from sklearn.model_selection import ParameterGrid

params = {
    'depth' : [1, 3, 5, 10],
    'learning_rate' : [0.01, 0.05, 0.1, 0.5, 1],
    'iterations' : [100, 200, 500]
}

grid = list(ParameterGrid(params))
min_rmse = 1e30
min_params = [0] * 3
for par in grid:
    model_catb = CatBoostRegressor(depth=par['depth'],
                            learning_rate=par['learning_rate'],
                            iterations=par['iterations'],
                            silent=True)
    model_catb.fit(X_train, y_train)
    if min_rmse > mean_squared_error(y_valid, model_catb.predict(X_valid)) ** 0.5:
        min_rmse = mean_squared_error(y_valid, model_catb.predict(X_valid)) ** 0.5
        min_params = [par['depth'],
                    par['learning_rate'],
                    par['iterations']]
min_rmse, min_params

(44920.72679988568, [5, 0.1, 500])

In [100]:
%%time
model_catb = CatBoostRegressor(depth=min_params[0],
                        learning_rate=min_params[1],
                        iterations=min_params[2],
                        silent=True)
model_catb.fit(X_train, y_train)
print('')


CPU times: total: 5.84 s
Wall time: 1.08 s


In [101]:
%%timeit
model_catb.predict(X_test)

3.36 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [102]:
print(f'MAPE on train set: {mean_absolute_percentage_error(y_train, model_catb.predict(X_train)) * 100}')
print(f'MAPE on valid set: {mean_absolute_percentage_error(y_valid, model_catb.predict(X_valid)) * 100}')
print(f'MAPE on test set: {mean_absolute_percentage_error(y_test, model_catb.predict(X_test)) * 100}')

print(f'RMSE on train set: {mean_squared_error(y_train, model_catb.predict(X_train)) ** 0.5}')
print(f'RMSE on valid set: {mean_squared_error(y_valid, model_catb.predict(X_valid)) ** 0.5}')
print(f'RMSE on test set: {mean_squared_error(y_test, model_catb.predict(X_test)) ** 0.5}')

MAPE on train set: 29.22923945542191
MAPE on valid set: 38.1173175961001
MAPE on test set: 34.11740980931401
RMSE on train set: 42347.2216991132
RMSE on valid set: 44920.72679988568
RMSE on test set: 50230.527909013326


Для применения catboost моделей не обязательно сначала кодировать категориальные признаки, модель может кодировать их сама. Обучите catboost с подбором оптимальных гиперпараметров снова, используя pool для передачи данных в модель с указанием какие признаки категориальные, а какие нет с помощью параметра cat_features. Оцените качество и время. Стало ли лучше?

In [103]:
from catboost import Pool
from sklearn.model_selection import ParameterGrid
# Повторим разбиение изначальных данных на обучающую, валидационную и тестовую выборки
df_p = pd.read_csv("homework_8_boosting_clustering_tursunov/ds_salaries.csv")
X_p = df_p.drop('salary', axis=1) # Сильная корелляция с целевой переменной
y_p = df_p['salary_in_usd']
X_p = X_p.drop('salary_in_usd', axis=1)
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_p, y_p, test_size=0.2, random_state=42)
X_valid_p, X_test_p, y_valid_p, y_test_p = train_test_split(X_test_p, y_test_p, test_size=0.5, random_state=42)

train_pool = Pool(data=X_train_p, label=y_train_p, cat_features=categorical_features)
valid_pool = Pool(data=X_valid_p, label=y_valid_p, cat_features=categorical_features)
test_pool = Pool(data=X_test_p, label=y_test_p, cat_features=categorical_features)
params = {
    'depth' : [1, 3, 5, 10],
    'learning_rate' : [0.01, 0.05, 0.1, 0.5, 1],
    'iterations' : [100, 200, 500]
}

grid = list(ParameterGrid(params))
min_rmse = 1e30
min_params = [0] * 3
for par in grid:
    model_catb = CatBoostRegressor(depth=par['depth'],
                            learning_rate=par['learning_rate'],
                            iterations=par['iterations'],
                            silent=True)
    model_catb.fit(pool)
    if min_rmse > mean_squared_error(y_valid_p, model_catb.predict(valid_pool)) ** 0.5:
        min_rmse = mean_squared_error(y_valid_p, model_catb.predict(valid_pool)) ** 0.5
        min_params = [par['depth'],
                    par['learning_rate'],
                    par['iterations']]
min_rmse, min_params

(43085.73325206323, [10, 0.05, 500])

In [104]:
%%time
model_catb = CatBoostRegressor(depth=min_params[0],
                        learning_rate=min_params[1],
                        iterations=min_params[2],
                        silent=True)
model_catb.fit(train_pool)

CPU times: total: 1min 30s
Wall time: 31.5 s


<catboost.core.CatBoostRegressor at 0x16d007fc690>

In [105]:
%%timeit
model_catb.predict(test_pool)

3.43 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [106]:
print(f'MAPE on train set: {mean_absolute_percentage_error(y_train_p, model_catb.predict(train_pool)) * 100}')
print(f'MAPE on valid set: {mean_absolute_percentage_error(y_valid_p, model_catb.predict(valid_pool)) * 100}')
print(f'MAPE on test set: {mean_absolute_percentage_error(y_test_p, model_catb.predict(test_pool)) * 100}')

print(f'RMSE on train set: {mean_squared_error(y_train_p, model_catb.predict(train_pool)) ** 0.5}')
print(f'RMSE on valid set: {mean_squared_error(y_valid_p, model_catb.predict(valid_pool)) ** 0.5}')
print(f'RMSE on test set: {mean_squared_error(y_test_p, model_catb.predict(test_pool)) ** 0.5}')

MAPE on train set: 30.65466463334311
MAPE on valid set: 43.14808268511115
MAPE on test set: 33.782829684513246
RMSE on train set: 44221.49498209315
RMSE on valid set: 46470.834380745706
RMSE on test set: 49547.73709691963


**Ответ:** CatBoost с Pool гораздо медленнее, но чуть-чуть точнее на тесте

## Задание 5 (0.5 балла) LightGBM

И наконец библиотека LightGBM - используйте `LGBMRegressor`, снова подберите гиперпараметры, оцените качество и скорость.


In [117]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import ParameterGrid

params = {
    'max_depth' : [1, 3, 5, 10, 15, 20],
    'learning_rate' : [0.01, 0.05, 0.1, 0.5, 1],
    'n_estimators' : [50, 100, 150, 200]
}

grid = list(ParameterGrid(params))
min_rmse = 1e30
min_params = [0] * 3
for par in grid:
    model_lgbm = LGBMRegressor(max_depth=par['max_depth'],
                            learning_rate=par['learning_rate'],
                            n_estimators=par['n_estimators'])
    model_lgbm.fit(X_train, y_train)
    if min_rmse > mean_squared_error(y_valid, model_lgbm.predict(X_valid)) ** 0.5:
        min_rmse = mean_squared_error(y_valid, model_lgbm.predict(X_valid)) ** 0.5
        min_params = [par['max_depth'],
                    par['learning_rate'],
                    par['n_estimators']]
min_rmse, min_params

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000323 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000380 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000286 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000249 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000252 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000257 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000298 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000311 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000246 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000227 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000308 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000313 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000310 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000247 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000300 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000291 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000264 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000377 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000266 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000321 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000290 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000300 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000514 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000297 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000203 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000304 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000277 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000258 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000247 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000219 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000254 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000264 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000247 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000251 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000224 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000304 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000285 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000269 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000336 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000281 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000336 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000275 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000325 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000310 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

(45352.26823559309, [1, 0.5, 50])

In [118]:
%%time
model_lgbm = LGBMRegressor(max_depth=min_params[0],
                        learning_rate=min_params[1],
                        n_estimators=min_params[2])
model_lgbm.fit(X_train, y_train)
print('')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000350 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 3004, number of used features: 41
[LightGBM] [Info] Start training from score 138055.989348

CPU times: total: 62.5 ms
Wall time: 21 ms


In [120]:
%%timeit
model_lgbm.predict(X_test)

1.01 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [121]:
print(f'MAPE on train set: {mean_absolute_percentage_error(y_train, model_lgbm.predict(X_train)) * 100}')
print(f'MAPE on valid set: {mean_absolute_percentage_error(y_valid, model_lgbm.predict(X_valid)) * 100}')
print(f'MAPE on test set: {mean_absolute_percentage_error(y_test, model_lgbm.predict(X_test)) * 100}')

print(f'RMSE on train set: {mean_squared_error(y_train, model_lgbm.predict(X_train)) ** 0.5}')
print(f'RMSE on valid set: {mean_squared_error(y_valid, model_lgbm.predict(X_valid)) ** 0.5}')
print(f'RMSE on test set: {mean_squared_error(y_test, model_lgbm.predict(X_test)) ** 0.5}')

MAPE on train set: 36.453471003584085
MAPE on valid set: 40.86635727920316
MAPE on test set: 35.917290385702685
RMSE on train set: 47893.793590264744
RMSE on valid set: 45352.26823559309
RMSE on test set: 50491.9236618416


## Задание 6 (2 балла) Сравнение и выводы

Сравните модели бустинга и сделайте про них выводы, какая из моделей показала лучший/худший результат по качеству, скорости обучения и скорости предсказания? Как отличаются гиперпараметры для разных моделей?

**Ответ:**
Скорость обучения:<br>
LightGBM 62ms<br>
CatBoost + Pool 90s<br>
CatBoost 5s<br>
XGBoost 469ms<br>
Скорость предсказания:<br>
LightGBM 1ms<br>
CatBoost + Pool 3.4ms<br>
CatBoost 3.6ms<br>
XGBoost 336us<br>
Результат качества на тесте (MAPE(проценты) и RMSE(тысячи)):<br>
LightGBM 36 50.5<br>
CatBoost + Pool 33 49.5<br>
CatBoost 34 50.2<br>
XGBoost 35 51.8<br>

Вообще говоря, результаты достаточно похожи. Лучшую скорость показывает LightGBM, однако его точность ниже, чем у остальных. Самую лучшую точность показал CatBoost + Pool, но он работал дольше всех остальных.<br>

Гиперпараметры у моделей достаточно схожи (у всех есть что-то про ограничение глубины и learning_rate, а так же про количество итераций)

# Часть 2 Кластеризация (5 баллов)

Будем работать с данными о том, каких исполнителей слушают пользователи музыкального сервиса.

Каждая строка таблицы - информация об одном пользователе. Каждый столбец - это исполнитель (The Beatles, Radiohead, etc.)

Для каждой пары (пользователь, исполнитель) в таблице стоит число - доля прослушивания этого исполнителя этим пользователем.


In [None]:
import pandas as pd
ratings = pd.read_excel("https://github.com/evgpat/edu_stepik_rec_sys/blob/main/datasets/sample_matrix.xlsx?raw=true", engine='openpyxl')
ratings.head()

Unnamed: 0,user,the beatles,radiohead,deathcab for cutie,coldplay,modest mouse,sufjan stevens,dylan. bob,red hot clili peppers,pink fluid,...,municipal waste,townes van zandt,curtis mayfield,jewel,lamb,michal w. smith,群星,agalloch,meshuggah,yellowcard
0,0,,0.020417,,,,,,0.030496,,...,,,,,,,,,,
1,1,,0.184962,0.024561,,,0.136341,,,,...,,,,,,,,,,
2,2,,,0.028635,,,,0.024559,,,...,,,,,,,,,,
3,3,,,,,,,,,,...,,,,,,,,,,
4,4,0.043529,0.086281,0.03459,0.016712,0.015935,,,,,...,,,,,,,,,,


Будем строить кластеризацию исполнителей: если двух исполнителей слушало много людей примерно одинаковую долю своего времени (то есть векторы близки в пространстве), то, возможно исполнители похожи. Эта информация может быть полезна при построении рекомендательных систем.

## Задание 1 (0.5 балла) Подготовка

Транспонируем матрицу ratings, чтобы по строкам стояли исполнители.

In [None]:
# -- YOUR CODE HERE --

Выкиньте строку под названием `user`.

In [None]:
# -- YOUR CODE HERE --

В таблице много пропусков, так как пользователи слушают не всех-всех исполнителей, чья музыка представлена в сервисе, а некоторое подмножество (обычно около 30 исполнителей)


Доля исполнителя в музыке, прослушанной  пользователем, равна 0, если пользователь никогда не слушал музыку данного музыканта, поэтому заполните пропуски нулями.



In [None]:
# -- YOUR CODE HERE --
ratings.sample()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
ben harper,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Задание 2 (0.5 балла) Первая кластеризация

Примените KMeans с 5ю кластерами, сохраните полученные лейблы

In [None]:
from sklearn.cluster import KMeans

# -- YOUR CODE HERE --

Выведите размеры кластеров. Полезной ли получилась кластеризация? Почему KMeans может выдать такой результат?

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

## Задание 3 (0.5 балла) Объяснение результатов

При кластеризации получилось $\geq 1$ кластера размера 1. Выведите исполнителей, которые составляют такие кластеры. Среди них должна быть группа The Beatles.

In [None]:
# -- YOUR CODE HERE --

Изучите данные, почему именно The Beatles выделяется?

Подсказка: посмотрите на долю пользователей, которые слушают каждого исполнителя, среднюю долю прослушивания.

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

## Задание 4 (0.5 балла) Улучшение кластеризации

Попытаемся избавиться от этой проблемы: нормализуйте данные при помощи `normalize`.

In [None]:
from sklearn.preprocessing import normalize

# -- YOUR CODE HERE --

Примените KMeans с 5ю кластерами на преобразованной матрице, посмотрите на их размеры. Стало ли лучше? Может ли кластеризация быть полезной теперь?

In [None]:
# -- YOUR CODE HERE --

**Ответ** # -- YOUR ANSWER HERE --

## Задание 5 (1 балл) Центроиды

Выведите для каждого кластера названия топ-10 исполнителей, ближайших к центроиду по косинусной мере. Проинтерпретируйте результат. Что можно сказать о смысле кластеров?

In [None]:
from scipy.spatial.distance import cosine


centroids = km.cluster_centers_

# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

## Задание 6 (1 балл) Визуализация

Хотелось бы как-то визуализировать полученную кластеризацию. Постройте точечные графики `plt.scatter` для нескольких пар признаков исполнителей, покрасив точки в цвета кластеров. Почему визуализации получились такими? Хорошо ли они отражают разделение на кластеры? Почему?

In [None]:
import matplotlib.pyplot as plt

# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

Для визуализации данных высокой размерности существует метод t-SNE (стохастическое вложение соседей с t-распределением). Данный метод является нелинейным методом снижения размерности: каждый объект высокой размерности будет моделироваться объектов более низкой (например, 2) размерности таким образом, чтобы похожие объекты моделировались близкими, непохожие - далекими с большой вероятностью.

Примените `TSNE` из библиотеки `sklearn` и визуализируйте полученные объекты, покрасив их в цвета их кластеров

In [None]:
from sklearn.manifold import TSNE

# -- YOUR CODE HERE --

## Задание 7 (1 балл) Подбор гиперпараметров

Подберите оптимальное количество кластеров (максимум 100 кластеров) с использованием индекса Силуэта. Зафиксируйте `random_state=42`

In [None]:
from sklearn.metrics import silhouette_score

# -- YOUR CODE HERE --

Выведите исполнителей, ближайших с центроидам (аналогично заданию 5). Как соотносятся результаты? Остался ли смысл кластеров прежним? Расскажите про смысл 1-2 интересных кластеров, если он изменился и кластеров слишком много, чтобы рассказать про все.

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

Сделайте t-SNE визуализацию полученной кластеризации.

In [None]:
# -- YOUR CODE HERE --

Если кластеров получилось слишком много и визуально цвета плохо отличаются, покрасьте только какой-нибудь интересный кластер из задания выше (`c = (labels == i)`). Хорошо ли этот кластер отражается в визуализации?

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --