Построить и сохранить модель для дальнейшего развертывания на сайте.
Предсказать пользовательям по их параметрам есть ли у них болезнь сердца

Используемые библиотеки:Pandas, numpy, sklearn

источник данных: https://www.kaggle.com/ronitf/heart-disease-uci

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import warnings 
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('C:\datasets/heart.csv')

In [3]:
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


Тут мы видим много признаков, но некоторые из них бесполезны: cp - chest pain type(0,1,2,3). Нам не известно что значат эти признаки так же как и restecg, oldpeak, slope, ca, thal. Для начала возьмем простые признаки, признаки, которые можно узнать в домашних уловиях - age, sex, trestbps(артериальное давление в состонии покоя), thalach(максимальная частота сердцебиения)

In [4]:
target = data['target']
data.drop('target', axis=1, inplace=True)

In [5]:
cols = ['age', 'sex', 'trestbps', 'thalach']

In [6]:
data = data[cols]

In [7]:
data.head()

Unnamed: 0,age,sex,trestbps,thalach
0,63,1,145,150
1,37,1,130,187
2,41,0,130,172
3,56,1,120,178
4,57,0,120,163


In [8]:
target.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Логистическая регрессия

In [9]:
model = LogisticRegression()

In [10]:
x_train, x_test, y_train, y_test = train_test_split(data.values, target.values,
                                                    test_size=.2, shuffle=True, random_state=42)

In [11]:
model.fit(x_train, y_train)

LogisticRegression()

In [12]:
model.score(x_test, y_test)

0.7540983606557377

In [13]:
data['age'].describe()

count    303.000000
mean      54.366337
std        9.082101
min       29.000000
25%       47.500000
50%       55.000000
75%       61.000000
max       77.000000
Name: age, dtype: float64

Деревья решений

In [14]:
model_dtc = DecisionTreeClassifier()

In [15]:
model_dtc.fit(x_train, y_train)

DecisionTreeClassifier()

In [16]:
model_dtc.score(x_test, y_test)

0.639344262295082

Метод близжайших соседей

In [17]:
model_knn = KNeighborsClassifier()

In [18]:
model_knn.fit(x_train, y_train)

KNeighborsClassifier()

In [19]:
model_knn.score(x_test, y_test)

0.6065573770491803

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [21]:
tree_parameters = {'max_depth': range(1, 10),
                  'max_features': range(1, 4)}

In [22]:
tree_grid = GridSearchCV(model_dtc, tree_parameters,
                        cv=5, n_jobs=-1, verbose=True)

In [23]:
tree_grid.fit(x_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:    3.8s finished


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'max_depth': range(1, 10),
                         'max_features': range(1, 4)},
             verbose=True)

In [24]:
tree_grid.best_params_

{'max_depth': 6, 'max_features': 2}

In [25]:
tree_grid.best_score_

0.7188775510204081

In [26]:
accuracy_score(y_test, tree_grid.predict(x_test))

0.7213114754098361

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

In [28]:
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=32)
cross_val_score(forest, x_train, y_train)

array([0.7755102 , 0.73469388, 0.70833333, 0.70833333, 0.64583333])

In [29]:
np.mean(cross_val_score(forest, x_train, y_train))

0.7145408163265307

In [30]:
forest_params = {'max_depth': range(1, 10),
                'max_features': range(2, 4)}

In [31]:
forest_grid = GridSearchCV(forest, forest_params, cv=5, n_jobs=-1)

In [32]:
forest_grid.fit(x_train, y_train)
forest_grid.best_params_, forest_grid.best_score_

({'max_depth': 5, 'max_features': 2}, 0.7352040816326532)

In [33]:
accuracy_score(y_test, forest_grid.predict(x_test))

0.7868852459016393

Добавим к нашим параметрам следующие:
1 - показания холестерина, 2 - сахара в крови, 3 - наличие стенокардии(ангины), вызванной физическими нагрузками

In [34]:
data2 = pd.read_csv('C:\datasets/heart.csv')

In [35]:
data2

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [36]:
cols = ['age', 'sex', 'trestbps', 'chol', 'fbs', 'thalach', 'exang']

In [37]:
target2 = data2['target']

In [38]:
data2 = data2[cols]

In [39]:
data2

Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang
0,63,1,145,233,1,150,0
1,37,1,130,250,0,187,0
2,41,0,130,204,0,172,0
3,56,1,120,236,0,178,0
4,57,0,120,354,0,163,1
...,...,...,...,...,...,...,...
298,57,0,140,241,0,123,1
299,45,1,110,264,0,132,0
300,68,1,144,193,1,141,0
301,57,1,130,131,0,115,1


In [40]:
x_train, x_test, y_train, y_test = train_test_split(data2.values, target2.values,
                                                    test_size=.2, shuffle=True, random_state=53)

In [41]:
tree_params = {'max_depth': range(1, 10),
              'max_features': range(2, 7)}

tree_grid = GridSearchCV(model_dtc, tree_params, cv=5, n_jobs=-1, verbose=True)

In [42]:
tree_grid.fit(x_train, y_train)
tree_grid.best_params_, tree_grid.best_score_

Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.3s finished


({'max_depth': 5, 'max_features': 2}, 0.7686224489795919)

In [43]:
np.mean(cross_val_score(tree_grid, x_train, y_train))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Done 175 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Done 190 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.2s finished


0.710969387755102

In [44]:
tree_grid.score(x_test, y_test)

0.7049180327868853

Дерево решений, с большим числом параметров,  дало меньшую точность

In [45]:
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=45)
cross_val_score(forest, x_train, y_train)

array([0.75510204, 0.71428571, 0.75      , 0.77083333, 0.79166667])

In [46]:
np.mean(cross_val_score(forest, x_train, y_train))

0.7563775510204082

В свою очередь Рандомный лес показал лучшее качество на тренеровочном наборе на датасете с большим количеством параметров

In [47]:
forest_params = {'max_depth': range(1, 10),
                'max_features': range(2, 7)}

In [48]:
forest_grid = GridSearchCV(forest, forest_params, cv=5, n_jobs=-1)


In [49]:
forest_grid.fit(x_train, y_train)
forest_grid.best_params_, forest_grid.best_score_

({'max_depth': 9, 'max_features': 2}, 0.7853741496598639)

In [50]:
np.mean(cross_val_score(forest_grid, x_train, y_train))

0.7398809523809524

In [51]:
accuracy_score(y_test, forest_grid.predict(x_test))

0.7377049180327869

Более результативным оказался Лес, обученный на датасете с меньшим числом параметров.
К тому же гиперпараметр 'max_features' равен 2, что значит используются при прогнозировании только 2 параметра

In [52]:
model = LogisticRegression().fit(x_train, y_train)
model.score(x_test, y_test)

0.8032786885245902

Лучше всех показала себя Логистическая регрессия с результатом на тестовых данных 80%.
Позже нам стоит проверить качество сделав кросс валидацию

In [53]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

In [54]:
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=34)
scores = []
for train_idx,test_idx in kf.split(x_train, y_train):
    
    xtr,xvl = x_train[train_idx],x_train[test_idx]
    ytr,yvl = y_train[train_idx],y_train[test_idx]
    
    lr = LogisticRegression()
    lr.fit(xtr,ytr)
    score = roc_auc_score(yvl,lr.predict(xvl))
    scores.append(score)
    print(score)


0.6717171717171717
0.7752525252525252
0.7698412698412698
0.8059440559440559
0.7604895104895104


In [55]:
np.mean(scores)

0.7566489066489066

Результаты на тестовой выборке у Случайного леса и Логистической регрессии совпадают, а вот на кросс валидации Регрессия выигрывает, к тому же мы не провели поиск гиперпараметров у нее. Давайте этим и займемся

In [56]:
#Grid Search
logreg = LogisticRegression(class_weight='balanced')
param = {'C':[0.1,0.3,0.5,1,2,3,3,4,5,10]}
clf = GridSearchCV(logreg,param,scoring='roc_auc',refit=True,cv=10)
clf.fit(x_train, y_train)
print('Best roc_auc: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_))

Best roc_auc: 0.8357, with best C: {'C': 1}


In [57]:
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=34)
scores = []
for train_idx,test_idx in kf.split(x_train, y_train):
    
    xtr,xvl = x_train[train_idx],x_train[test_idx]
    ytr,yvl = y_train[train_idx],y_train[test_idx]
    
    logreg = LogisticRegression(class_weight='balanced', C=1)
    logreg.fit(xtr,ytr)
    score = roc_auc_score(yvl, logreg.predict(xvl))
    scores.append(score)
    print(score)


0.6759259259259259
0.7794612794612794
0.7936507936507936
0.8129370629370629
0.7674825174825175


In [58]:
np.mean(scores)

0.7658915158915158

Ура! У нас есть победитель с результатом на кросс валидации 76%. Сохраним нашу модель с помощю pickle

In [59]:
import pickle

pickle.dump(logreg, open('pickle_model.pkl', 'wb'))

В наших данных уровень холестерина и сахара в крови измеряется в мг/дл, но пользоватли будут вводить в ммоль/л, так что это надо подкорректировать используя
:
мг/дл х 0,026 ==> ммоль/л.

Так же при получении данных от пользователя, а именно параметр fbs(fasting blood sugar:Уровень сахара в крови натощак)
потребуется обработать их простым классификатором: если уровень сахара больше 120, то 1, иначе 0