## Задача:

Построить модель машинного обучения для решения задачи регрессии: необходимо спрогнозировать прогресс заболевания диабетом через год после исходного уровня.

### План решения:

1. В документации библиотеки scikit learn найдите, как загрузить датасет для построения модели прогноза прогрессирования заболевания через год после исходного уровня (load_diabetes из sklearn.datasets).

In [505]:
import pandas as pd

from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

In [506]:
df = load_diabetes(as_frame=True).frame
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


2. Выведите содержимое поля DESCR, которое вернет функция, загружающая датасет, чтобы изучить содержимое датасета.

In [507]:
print(load_diabetes().DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

In [508]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


In [509]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,442.0,-2.511817e-19,0.047619,-0.107226,-0.037299,0.005383,0.038076,0.110727
sex,442.0,1.23079e-17,0.047619,-0.044642,-0.044642,-0.044642,0.05068,0.05068
bmi,442.0,-2.245564e-16,0.047619,-0.090275,-0.034229,-0.007284,0.031248,0.170555
bp,442.0,-4.79757e-17,0.047619,-0.112399,-0.036656,-0.00567,0.035644,0.132044
s1,442.0,-1.3814990000000001e-17,0.047619,-0.126781,-0.034248,-0.004321,0.028358,0.153914
s2,442.0,3.9184340000000004e-17,0.047619,-0.115613,-0.030358,-0.003819,0.029844,0.198788
s3,442.0,-5.777179e-18,0.047619,-0.102307,-0.035117,-0.006584,0.029312,0.181179
s4,442.0,-9.04254e-18,0.047619,-0.076395,-0.039493,-0.002592,0.034309,0.185234
s5,442.0,9.293722000000001e-17,0.047619,-0.126097,-0.033246,-0.001947,0.032432,0.133597
s6,442.0,1.130318e-17,0.047619,-0.137767,-0.033179,-0.001078,0.027917,0.135612


3. Подготовьте данные к обучению моделей: отделите целевой признак от датасета, разделите датасет на тренировочную и валидационную части.

In [510]:
df['target'].value_counts().count()

np.int64(214)

In [511]:
features = df.drop(['target'], axis=1)
target = df['target']

In [512]:
features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.2, random_state=42)

4. Обучите решающее дерево и оцените адекватность обученной модели. Для этого достаточно сравнить метрики качества с метриками, получаемыми от простой модели. Например, если бы мы подавали на вход среднее значение целевой переменной на обучающей части выборки.

In [513]:
model_tree = tree.DecisionTreeRegressor()
model_tree.fit(features_train, target_train)
predictions_tree_val = model_tree.predict(features_val) 

In [514]:
def count_errors(true_answers, pred_answers):
    all_answers_together = zip(true_answers, pred_answers)
    errors_list = [1 if v[0] != v[1] else 0 for v in all_answers_together]
    return sum(errors_list)

print("Ошибок:", count_errors(target_val, predictions_tree_val))

mse_tree = mean_squared_error(target_val, predictions_tree_val)
print('MSE =', mse_tree)

rmse_tree = root_mean_squared_error(target_val, predictions_tree_val)
print('RMSE =', rmse_tree)

r2_tree = r2_score(target_val, predictions_tree_val)
print('R2 =', r2_tree)

mae_tree = mean_absolute_error(target_val, predictions_tree_val)
print('MAE =', mae_tree)

Ошибок: 89
MSE = 4562.820224719101
RMSE = 67.54865079865846
R2 = 0.13879019678954674
MAE = 52.8876404494382


Чтобы оценить адекватность модели, создаю простую модель со средним значением целевой переменной на входе.

In [515]:
target_avg_train = pd.Series([df['target'].mean()] * len(target_train))
target_avg_val = pd.Series([df['target'].mean()] * len(target_val))

In [516]:
model_tree_avg = tree.DecisionTreeRegressor()
model_tree_avg.fit(features_train, target_avg_train)
predictions_tree_avg = model_tree_avg.predict(features_val) 


In [517]:
print("Ошибок:", count_errors(target_val, predictions_tree_avg))

mse_tree_avg = mean_squared_error(target_val, predictions_tree_avg)
print(f'MSE = {mse_tree_avg:.25f}')

rmse_tree_avg = root_mean_squared_error(target_val, predictions_tree_avg)
print('RMSE =', rmse_tree_avg)

r2_tree_avg = r2_score(target_val, predictions_tree_avg)
print('R2 =', r2_tree_avg)

mae_tree_avg = mean_absolute_error(target_val, predictions_tree_avg)
print('MAE =', mae_tree_avg)

Ошибок: 89
MSE = 5338.5784972632072822307236493
RMSE = 73.06557669151191
R2 = -0.007630349349265986
MAE = 63.79177894148169


Сравнив показатели двух моделей рещающего дерева, можно утверждать, что полученная модель адекватна.

5. Обучите линейную регрессию и оцените ее адекватность.

In [518]:
lin_reg_model = LinearRegression()
lin_reg_model.fit(features_train, target_train)
predictions_lin_reg_val = lin_reg_model.predict(features_val)

In [519]:
print("Ошибок:", count_errors(target_val, predictions_lin_reg_val))

mse_lin_reg = mean_squared_error(target_val, predictions_lin_reg_val)
print('MSE =', mse_lin_reg)

rmse_lin_reg = root_mean_squared_error(target_val, predictions_lin_reg_val)
print('RMSE =', rmse_lin_reg)

r2_lin_reg = r2_score(target_val, predictions_lin_reg_val)
print('R2 =', r2_lin_reg)

mae_lin_reg = mean_absolute_error(target_val, predictions_lin_reg_val)
print('MAE =', mae_lin_reg)

Ошибок: 89
MSE = 2900.1936284934827
RMSE = 53.85344583676594
R2 = 0.45260276297191926
MAE = 42.794094679599944


Оценим адекватность получившейся модели с простой моделью

In [520]:
lin_reg_model_avg = LinearRegression()
lin_reg_model_avg.fit(features_train, target_avg)
predictions_lin_reg_avg = lin_reg_model_avg.predict(features_val)

In [521]:
print("Ошибок:", count_errors(target_val, predictions_lin_reg_avg))

mse_lin_reg_avg = mean_squared_error(target_val, predictions_lin_reg_avg)
print('MSE =', mse_lin_reg_avg)

rmse_lin_reg_avg = root_mean_squared_error(target_val, predictions_lin_reg_avg)
print('RMSE =', rmse_lin_reg_avg)

r2_lin_reg_avg = r2_score(target_val, predictions_lin_reg_avg)
print('R2 =', r2_lin_reg_avg)

mae_lin_reg_avg = mean_absolute_error(target_val, predictions_lin_reg_avg)
print('MAE =', mae_lin_reg_avg)

Ошибок: 89
MSE = 5338.57849726319
RMSE = 73.06557669151178
R2 = -0.007630349349262655
MAE = 63.79177894148152


Сравнив показатели двух моделей линейной регрессии, можно утверждать, что полученная модель адекватна.

6. Выберите лучшую модель и обоснуйте свой выбор.

In [525]:
tree_rate = 0
lin_reg_rate = 0

if mse_tree < mse_lin_reg:
    tree_rate += 1
else:
    lin_reg_rate += 1

if rmse_tree < rmse_lin_reg:
    tree_rate += 1
else:
    lin_reg_rate += 1

if r2_tree > r2_lin_reg:
    tree_rate += 1
else:
    lin_reg_rate += 1

if mae_tree < mae_lin_reg:
    tree_rate += 1
else:
    lin_reg_rate += 1

print(f'tree_rate = {tree_rate}\nlin_reg_rate = {lin_reg_rate}' )
if tree_rate > lin_reg_rate:
    print("Лучшая модель: Решающее дерево")
elif lin_reg_rate > tree_rate:
    print("Лучшая модель: Линейная регрессия")
else:
    print("Модели одинаковы")

tree_rate = 0
lin_reg_rate = 4
Лучшая модель: Линейная регрессия


## Вывод:

По всем показателям модель линейной регрессии лучше