#**Fetal Health Classification**

Classify the health of a fetus as Normal, Suspect or Pathological using CTG data

###**Content**

Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress.
The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.

###**Objective**

To predict the Cardiotocogram exams by  classified into 3 classes:

Normal
Suspect
Pathological

###**Dataset link**
https://www.kaggle.com/andrewmvd/fetal-health-classification


###**Presentation link**
https://docs.google.com/presentation/d/1zj1to4mInQIUY-XmR9nPQdWq5o4-AYxM841PBrsoxrk

###**Other useful links**
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822315/

#Importing the libraries

In [None]:
!pip install catboost

In [None]:
#Imports for heldel, manipoletion and visualization of the Data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
#import json

#Impports for classification model assessmenton of the Data
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import SCORERS

#Imports for model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedShuffleSplit

#Impports for classification models on the Data
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier, Pool, cv

#Imports For the tree visualization
!pip install pydot
import pydot
from IPython.display import Image, display
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz


#Imports For the preproccessing
from sklearn.preprocessing import StandardScaler
from sklearn.utils import compute_sample_weight

%matplotlib inline

In [None]:
plt.style.use('seaborn')

# Importing the dataset



Run the following cells below to get the data for this price regression.

Note: after getting the data we are preforming an EDA prossece befor the regression.

In [None]:
url = 'https://raw.githubusercontent.com/danihello/Fetal_health_classification/main/fetal_health.csv'

resp = requests.get(url)
text = resp.text

with open('fetal_health.csv', 'w') as fh:
    fh.write(text)
  
initial_df = pd.read_csv('fetal_health.csv')
#initial_df.sort_values('date', inplace=True, ignore_index=True)

In [None]:
initial_df.head(5)

In [None]:
initial_df.info()

###***Dataset Structure***

- 'baseline value'- FHR baseline (beats per minute)
- 'accelerations'- Number of accelerations per second
- 'fetal_movement'- Number of fetal movements per second
- 'uterine_contractions'- Number of uterine contractions per second
- 'light_decelerations'- Number of light decelerations per second
- 'severe_decelerations'- Number of severe decelerations per second
- 'prolongued_decelerations'- Number of prolonged decelerations per second
- 'abnormal_short_term_variability'- Percentage of time with abnormal short term variability
- 'mean_value_of_short_term_variability'- Mean value of short term variability
- 'percentage_of_time_with_abnormal_long_term_variability'- Percentage of time with abnormal long term variability
- 'mean_value_of_long_term_variability'- Mean value of long term variability
- 'histogram_width'- Width of FHR histogram
- 'histogram_min'- Minimum (low frequency) of FHR histogram
- 'histogram_max'- Maximum (high frequency) of FHR histogram
- 'histogram_number_of_peaks'- Number of histogram peaks
- 'histogram_number_of_zeroes'- Number of histogram zeros
- 'histogram_mode'- Histogram mode
- 'histogram_mean'- Histogram mean
- 'histogram_median'- Histogram median
- 'histogram_variance'- Histogram variance
- 'histogram_tendency'- Histogram tendency

Target

- 'fetal_health' Tagged as 1 (Normal), 2 (Suspect) and 3 (Pathological)

In [None]:
initial_df.nunique()

# Data Cleaning

## Missing values

No missing values are found.

In [None]:
len(initial_df.loc[initial_df.isnull().sum(axis=1)>0])

## Duplicated values

There are 13 duplicates in the data set, as this is not material regarding to the data set we decided to delete them.

In [None]:
print('There are ',len(initial_df.loc[initial_df.duplicated(keep='first')>0]), 'duplicates in the dataset', '\n')

print(f'Before duplicates removal: {initial_df.shape}')
initial_df = initial_df.drop_duplicates(keep='first')
print(f'After duplicates removal: {initial_df.shape}')


#Exploratory Data Analysis (EDA)

In [None]:
sns.set(style="whitegrid", font_scale=1)
h = initial_df.hist(bins=25,figsize=(20,20),xlabelsize='10',ylabelsize='10',xrot=-15)


In [None]:
initial_df.describe().T

In [None]:
plt.figure(figsize=(15,15))
print('before standardize')
sns.boxplot(data=initial_df)
print('after standardize')
# sns.boxplot(pd.DataFrame(X_train, columns=X.columns))

##Target class analysis

In [None]:
class_dict = {1:'Normal', 2:'Suspect', 3:'Pathological'}
initial_df['class_label'] = initial_df.fetal_health.map(class_dict)
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
class_order = initial_df.class_label.value_counts().index
class_dict = {1:'Normal', 2:'Suspect', 3:'Pathological'}
initial_df['class_label'] = initial_df.fetal_health.map(class_dict)
sns.countplot(data=initial_df , x='class_label' ,order=initial_df.class_label.value_counts().index)
plt.xlabel('Fetal health')
plt.subplot(1,2,2)
initial_df['fetal_health'].value_counts().plot(kind='pie', autopct='%.2f%%', explode=[0,0.1,0.1], 
                                       startangle=90,
                                      labels = class_order)
plt.suptitle('Distribution of the target variable', fontsize=20)
plt.show()

We might have to use imbalced data technique to fix overfitting, but for now we will first run a regular model

**Kde histograms**

In [None]:
kde_features = ['baseline value','accelerations','fetal_movement','uterine_contractions','light_decelerations','severe_decelerations','prolongued_decelerations',
            'abnormal_short_term_variability','mean_value_of_short_term_variability','percentage_of_time_with_abnormal_long_term_variability','mean_value_of_long_term_variability',
            'histogram_width','histogram_min','histogram_max','histogram_number_of_peaks','histogram_number_of_zeroes','histogram_mode','histogram_mean','histogram_median',
            'histogram_variance','histogram_tendency']

plt.subplots(21,1,figsize=(10,150))
for i,feature in enumerate(kde_features,1):
    plt.subplot(21,1,i)
    sns.kdeplot(x=initial_df[feature],hue=initial_df['class_label'], hue_order=class_order, fill=True);
    plt.title(f'{feature} kde plot', fontsize=16)
    # plt.legend(title='fetal health',labels=['Normal', 'Suspect', 'Pathological'])


## Pearson Correlation Matrix

We run a test of the Pearson model in order to better understand the relationships between the features.

In [None]:
features = ['baseline value','accelerations','fetal_movement','uterine_contractions','light_decelerations','severe_decelerations','prolongued_decelerations',
            'abnormal_short_term_variability','mean_value_of_short_term_variability','percentage_of_time_with_abnormal_long_term_variability','mean_value_of_long_term_variability',
            'histogram_width','histogram_min','histogram_max','histogram_number_of_peaks','histogram_number_of_zeroes','histogram_mode','histogram_mean','histogram_median',
            'histogram_variance','histogram_tendency','fetal_health']

mask = np.zeros_like(initial_df[features].corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 

f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)

sns.heatmap(initial_df[features].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="Blues", #"BuGn_r" to reverse 
            linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});

## Scatter plots of most corraleted

Normal and pathological conditions seems to be well distinguished from each other. suspect looks to be tougher to distinguish

In [None]:
new_names_initial_df = initial_df.rename(columns = {'abnormal_short_term_variability':'ab_short_var',
                                                    'percentage_of_time_with_abnormal_long_term_variability':'%_ab_long_var',
                                                    'mean_value_of_short_term_variability':'mean_short_var'   
})

cols_to_use = [
               'prolongued_decelerations',
            'ab_short_var',
            '%_ab_long_var',
            'accelerations',
            'histogram_mode',
            'histogram_mean',
            'class_label'
            ]            
# cols_to_use=['mean_short_var', 'uterine_contractions', 'ab_short_var','histogram_mean', 'histogram_median','class_label'
             
# ]

sns.pairplot(new_names_initial_df[cols_to_use], hue='class_label', hue_order=class_order)
#plt.title('Pair plots', fontsize=20)
plt.show()

In [None]:
sns.scatterplot(data=initial_df, x='histogram_mean', y='abnormal_short_term_variability', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
sns.scatterplot(data=initial_df, x='abnormal_short_term_variability', y='baseline value', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
sns.scatterplot(data=initial_df, x='baseline value', y='mean_value_of_short_term_variability', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
sns.scatterplot(data=initial_df, x='histogram_mean', y='mean_value_of_short_term_variability', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
sns.scatterplot(data=initial_df, x='histogram_mean', y='histogram_median', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=initial_df, x='histogram_mean', y='histogram_min', hue=initial_df['class_label'], hue_order=class_order)
plt.show()

In [None]:
# kde_features = ['baseline value','accelerations','fetal_movement','uterine_contractions','light_decelerations','severe_decelerations','prolongued_decelerations',
#             'abnormal_short_term_variability','mean_value_of_short_term_variability','percentage_of_time_with_abnormal_long_term_variability','mean_value_of_long_term_variability',
#             'histogram_width','histogram_min','histogram_max','histogram_number_of_peaks','histogram_number_of_zeroes','histogram_mode','histogram_mean','histogram_median',
#             'histogram_variance','histogram_tendency']
# for feature_1 in kde_features:
#             sns.scatterplot(data=initial_df, x='abnormal_short_term_variability', y=feature_1, hue=initial_df['class_label'])
#             plt.show()

# Splitting the data

We used **$stratify=y$** to maintain a proportion between the classes when dividing between test and train.

In [None]:
test_size=0.25
X = initial_df.drop(columns=['class_label', 'fetal_health'])
y = initial_df.class_label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=22, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [None]:
print(y.value_counts()/y.shape[0])
print()
print(y_train.value_counts()/y_train.shape[0])
print()
print(y_test.value_counts()/y_test.shape[0])

#Feature Engineering

After reviewing the data and investigating the data, we found it appropriate that only scaling is needed in order for the models to work in the most optimal way.
In addition, this does not harm the models if it does not help.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
plt.figure(figsize=(16,10))
sns.boxplot(data=X_train)
plt.title(label='After standardize', size=22)
plt.show()

# sns.boxplot(pd.DataFrame(X_train, columns=X.columns))

# **Classification**

##**Tools**

###Report function

In [None]:
def report(clf, X, y):
    acc = accuracy_score(y_true=y, 
                         y_pred=clf.predict(X))
    cm = pd.DataFrame(confusion_matrix(y_true=y, 
                                       y_pred=clf.predict(X)), 
                      index=clf.classes_, 
                      columns=clf.classes_)
    rep = classification_report(y_true=y, 
                                y_pred=clf.predict(X))
    return 'accuracy: {:.3f}\n\n{}\n\n{}'.format(acc, cm, rep)

###Cross Validation Function

In [None]:
def cv_score(classifier, scoring_measure, cv, X, y):
    scores = cross_val_score(classifier, X, y, cv=cv, scoring=scoring_measure)
    print(f'Mean {scoring_measure}: {np.mean(scores):.3f} ({np.std(scores):.3f})')

###Cross Validation measurement

we use StratifiedShuffleSplit in order to avoid the model under/over fitting due to not training on certain classes
another reason we use it is to deal with imbalanced data
we also use it to run 10 times to get a more robust results

In [None]:
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, train_size=0.8, random_state=22)

###Measurement

We want to minimize both False Negative (saying a fetus is healthy while he is acctually sick)
and False Positive (saying the fetus is sick while he is healty) thus we are measuring our model with the F1 metric.

since the smallest class is the one we are more concerned about F1_macro would be more difficult to maximize this is why we chose to measure our model using the macro instead of the weighted F1.

In [None]:
measure = 'f1_macro'

##**DecisionTree Classifier**

In [None]:
clf = DecisionTreeClassifier(class_weight='balanced',random_state=22)

print('Cross Validation:')
print()
scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring=measure)
print(f'Mean {measure}: {np.mean(scores):.3f} ({np.std(scores):.3f})')

print()
print()
print('Classification Report for X train:')
print()
clf.fit(X_train, y_train)
print(classification_report(y_train, clf.predict(X_train)))

print()
print()
print('Classification Report for X test:')
print()
print(classification_report(y_test, clf.predict(X_test)))

##**Random Forest Classifier**

In [None]:
clf_rf = RandomForestClassifier(class_weight='balanced', random_state=22)

measure = 'f1_macro'
scores = cross_val_score(clf_rf, X_train, y_train, cv=cv, scoring=measure)
print(f'Mean {measure}: {np.mean(scores):.3f} ({np.std(scores):.3f})')


In [None]:
clf_rf.fit(X_train, y_train)

print(classification_report(y_train, clf_rf.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_rf.predict(X_test)))

Random Forest grid search

In [None]:
parameters = { 
    'n_estimators': [100,300,600],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [4,8,12,14,16],
    'criterion' :['entropy', 'gini'],
    'class_weight' :['balanced']
}

CV_rfc = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameters, cv=cv, scoring=measure, verbose=2)
# CV_rfc.fit(X_train, y_train)

In [None]:
best_params = {'class_weight': 'balanced',
 'criterion': 'entropy',
 'max_depth': 12,
 'max_features': 'sqrt',
 'n_estimators': 300}

In [None]:
rf_clsf_best = RandomForestClassifier(**best_params, random_state=22)

In [None]:
cv_score(rf_clsf_best, measure, cv, X_train, y_train)

In [None]:
rf_clsf_best.fit(X_train, y_train)
print(classification_report(y_train, rf_clsf_best.predict(X_train)))

In [None]:
print(classification_report(y_test, rf_clsf_best.predict(X_test)))

##**XGBoost**

In [None]:
clf_xgb = XGBClassifier(random_state=22)

In [None]:
cv_score(clf_xgb, measure, cv, X_train, y_train)

In [None]:
#limhok ahar kah
# kuku = cv_score(clf_xgb, measure, cv, X_train, y_train)
# type(kuku)
kuku = cross_val_score(clf_xgb, X_train, y_train, cv= StratifiedShuffleSplit(n_splits=10, test_size=0.2, train_size=0.8, random_state=22), scoring='f1_macro')
kuku_mean = np.mean(kuku)
kuku_mean
kuku_std = np.std(kuku)
kuku_std
d = {'cv_mean':kuku_mean,'cv_std':kuku_std}
dd = pd.DataFrame(d.items()).transpose()
dd

In [None]:
clf_xgb.fit(X_train, y_train)
print(classification_report(y_train, clf_xgb.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_xgb.predict(X_test)))

Class Weights

In [None]:
classes_weights = compute_sample_weight(
    class_weight='balanced',
    y=y_train
)

clf_xgb_weight = XGBClassifier(random_state=22)
clf_xgb_weight.fit(X_train, y_train,sample_weight=classes_weights)

# pd.Series(classes_weights).head(60)

In [None]:
print(classification_report(y_train, clf_xgb_weight.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_xgb_weight.predict(X_test)))

Adding Class weights did not yield a better result

XGBoost Gid Search

In [None]:
cv = StratifiedShuffleSplit(n_splits=5, random_state=22, test_size=0.2,
            train_size=0.8)

In [None]:
def gridsearch_cv_score(classifier, cv, X, y, params, scoring):
    search = GridSearchCV(estimator=classifier, param_grid=params, cv=cv, scoring=scoring, verbose=2)
    search.fit(X, y)
    return search.best_params_

In [None]:
params = {'colsample_bytree': [0.6, 0.8, 1.0],
 'learning_rate': [0.1, 0.3, 0.7],
 'max_depth': [3, 4, 5, 7],
 'min_child_weight': [1, 5, 10],
 'n_estimators': [10],
 'subsample': [0.6, 0.8, 1.0],
 'gamma': [0.5, 1, 1.5, 2, 5]
 }



In [None]:
# gridsearch_cv_score(XGBClassifier(random_state=22), cv, X_train, y_train, params=params,scoring=measure)

In [None]:
# CV_xgboost.best_params_
best_params ={'colsample_bytree': 0.7,
 'learning_rate': 0.1,
 'max_depth': 10,
 'min_child_weight': 1,
 'n_estimators': 100,
 'subsample': 0.7}

In [None]:
new_best_params = {'colsample_bytree': 1.0,
 'gamma': 0.5,
 'learning_rate': 0.3,
 'max_depth': 7,
 'min_child_weight': 1,
 'n_estimators': 100,
 'subsample': 0.8}

In [None]:
xg_clsf_best = XGBClassifier(**new_best_params, random_state=22)

In [None]:
cv_score(xg_clsf_best, measure, cv, X_train, y_train)

In [None]:
xg_clsf_best.fit(X_train, y_train)
print(classification_report(y_train, xg_clsf_best.predict(X_train)))

In [None]:
print(classification_report(y_test, xg_clsf_best.predict(X_test)))

the tuned XGBoost test score is pretty similar to the cross validated mean score this means that the model is more generalized and would probably yield a similar results with unseen data

In [None]:
f1_result = pd.Series(f1_score(y_test, clf_xgb.predict(X_test),average=None), index=['Normal', 'Pathological', 'Suspect'],name='F1')
recall_result = pd.Series(recall_score(y_test, clf_xgb.predict(X_test),average=None), index=['Normal', 'Pathological', 'Suspect'], name='Recall')
precision_result = pd.Series(precision_score(y_test, clf_xgb.predict(X_test),average=None), index=['Normal', 'Pathological', 'Suspect'], name='Precision')
df_metric = pd.DataFrame([f1_result.T, recall_result.T, precision_result.T])


In [None]:
blue, orange, green = list(sns.color_palette())[:3]


In [None]:
df_metric

In [None]:
# plt.figure(figsize=(15,12))

df_metric.plot(kind='bar', figsize=(16,12), color=[blue, green, orange])
plt.ylim(0,1.4)
plt.yticks(np.arange(0,1.4,0.1))
plt.title('XGBoost Macro Metrics', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.legend(fontsize=16)
plt.ylim(0.7,1)
plt.show()

##**Cat Boost**

In [None]:
clf_cat_boost = CatBoostClassifier(random_state=22)

In [None]:
scores = cross_val_score(clf_cat_boost, X_train, y_train, cv=5, scoring=measure)
print(f'Mean {measure}: {np.mean(scores):.3f} ({np.std(scores):.3f})')

In [None]:
clf_cat_boost.fit(X_train, y_train)
print(classification_report(y_train, clf_cat_boost.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_cat_boost.predict(X_test)))

##LogisticRegression

In [None]:
clf_lr = LogisticRegression(random_state=22, max_iter=1000, class_weight='balanced')

In [None]:
cv_score(clf_lr, measure, cv, X_train, y_train)

In [None]:
clf_lr.fit(X_train, y_train)

In [None]:
print(classification_report(y_train, clf_lr.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_lr.predict(X_test)))

Grid Search LogisticRegression

In [None]:
grid_param = {'penalty' : ['l1', 'l2'],
    'C' : np.logspace(-2, 2, 20),
    'class_weight':['balanced',None],
    'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'multi_class':['auto', 'ovr']}

GSCV_lr = GridSearchCV(estimator=LogisticRegression(random_state=22, max_iter=1000), param_grid=grid_param, cv=cv, scoring=measure, verbose=2)
# GSCV_lr.fit(X_test, y_test)

In [None]:
lr_best_params = {'C': 5.455594781168514,
 'class_weight': 'balanced',
 'penalty': 'l1',
 'solver': 'liblinear',
 'multi_class':'auto'
}

In [None]:
clf_lr_best = LogisticRegression(random_state=22, max_iter=1000,**lr_best_params,)
cv_score(clf_lr_best, measure, cv, X_train, y_train)

In [None]:
clf_lr_best.fit(X_train, y_train)
print(classification_report(y_train, clf_lr_best.predict(X_train)))
print()
print()
print(classification_report(y_test, clf_lr_best.predict(X_test)))

##KNeighborsClassifier

In [None]:
clf_knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
cv_score(clf_knn, measure, cv, X_train, y_train)

In [None]:
clf_knn.fit(X_train, y_train)
print(classification_report(y_train, clf_knn.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_knn.predict(X_test)))

*Grid* Search knn

In [None]:
test_param = {'n_neighbors':[5, 10, 15, 20],
              'p':[1, 2],
              'algorithm':['auto', 'ball_tree', 'kd_tree']}

GSCV_knn = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=test_param, cv=cv, scoring=measure, verbose=2)
# GSCV_knn.fit(X_train, y_train)


In [None]:
knn_best_params =  {'algorithm': 'auto', 'n_neighbors': 5, 'p': 1}

In [None]:
clf_knn_best = KNeighborsClassifier(**knn_best_params)

cv_score(clf_knn_best, measure, cv, X_train, y_train)

In [None]:
clf_knn_best.fit(X_train, y_train)
print(classification_report(y_train, clf_knn_best.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_knn_best.predict(X_test)))

SVM

In [None]:
svc_class = SVC(class_weight='balanced', random_state=22,)

In [None]:
cv_score(svc_class, measure, 5, X_train, y_train)

In [None]:
test_param = {'C':[0.01, 0.1, 1, 10, 50, 100],
              'kernel':['rbf', 'linear', 'poly'],
              'gamma':['scale', 'auto']}

GSCV_svc = GridSearchCV(estimator=svc_class, param_grid=test_param, cv=5, scoring=measure, verbose=2)
# GSCV_svc.fit(X_train, y_train)

In [None]:
# GSCV_svc.best_params_

In [None]:
svc_params = {'C': 50, 'gamma': 'scale', 'kernel': 'poly'}

In [None]:
clf_svc_best = SVC(**svc_params)

cv_score(clf_svc_best, measure, 5, X_train, y_train)

In [None]:
clf_svc_best.fit(X_train, y_train)
print(classification_report(y_train, clf_svc_best.predict(X_train)))

In [None]:
print(classification_report(y_test, clf_svc_best.predict(X_test)))

##Feature Importances

צריך להוסיף בדיקה של זה

In [None]:
# classifier.feature_importances_
def plot_feture_importance(classifier, columns, title='title', n_largest=None):
    if n_largest is None:
        n_largest = len(classifier.feature_importances_)
    feat_importances = pd.Series(classifier.feature_importances_, index=columns).nlargest(n_largest)
    result = feat_importances.sort_values().plot(kind='barh')
    plt.title(title, fontsize=16)
    plt.yticks(fontsize=14)
    return result

In [None]:
plt.figure(figsize=(10,12))
plot_feture_importance(xg_clsf_best, X.columns, n_largest=20, title='XG-boost feature importance')


##**SMOTE**

LINK: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [None]:
from imblearn.over_sampling import SMOTE, SVMSMOTE

In [None]:
X_train.shape[0]

In [None]:
y.value_counts()/y.shape[0]

In [None]:
over_sample = SMOTE()

In [None]:
X_resample, y_resample = over_sample.fit_resample(X_train, y_train)

y_resample.shape[0]

In [None]:
pd.Series(y_resample).value_counts()/y_resample.shape[0]

In [None]:
X_resample.shape

In [None]:
clf_xgb_resample = XGBClassifier(random_state=22)

In [None]:
cv_score(clf_xgb, 'f1_macro', cv, X_train, y_train)

In [None]:
print(classification_report(y_test, clf_xgb.predict(X_test)))

In [None]:
cv_score(clf_xgb_resample, 'f1_macro', cv, X_resample, y_resample)

In [None]:
def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):

  #creating a set of all the unique classes using the actual class list
  unique_class = set(actual_class)
  roc_auc_dict = {}
  for per_class in unique_class:
    #creating a list of all the classes except the current class 
    other_class = [x for x in unique_class if x != per_class]

    #marking the current class as 1 and all other classes as 0
    new_actual_class = [0 if x in other_class else 1 for x in actual_class]
    new_pred_class = [0 if x in other_class else 1 for x in pred_class]

    #using the sklearn metrics method to calculate the roc_auc_score
    roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
    roc_auc_dict[per_class] = roc_auc

  return roc_auc_dict

SMOTE Right Way

In [None]:
from imblearn.over_sampling import SMOTE, SVMSMOTE
from imblearn.pipeline import Pipeline as imbpipeline
import warnings
warnings.filterwarnings("ignore")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=22, stratify=y)


In [None]:
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=22)],
                                ['scaler', StandardScaler()],
                                ['classifier', xg_clsf_best]])

In [None]:
cv_score(pipeline, measure, cv, X_train, y_train)

In [None]:
pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))

Maybe we can test grid search on smote and xgboost

In [None]:
classifiers = [clf,rf_clsf_best,xg_clsf_best, clf_cat_boost,clf_svc_best,clf_lr,clf_knn]


In [None]:
# type(classification_report(y_test, clf_svc_best.predict(X_test)))
# print(classification_report(y_test, clf_svc_best.predict(X_test)))
a = classification_report(y_test, clf_svc_best.predict(X_test), output_dict=True)['Pathological']
for k,v in a.items():
  print(k)

# t = pd.DataFrame(data = classification_report(y_test, clf_svc_best.predict(X_test), output_dict=True)['Pathological'])

In [None]:
# type(a.items())
list(a.items())[:3]

In [None]:

b = pd.DataFrame(data = list(a.items())[:3] )
b.iloc[:,0]
b = b.transpose()
b.columns = b.iloc[0]
b.drop(b.index[0])
b


In [None]:
b.drop(b.index[0], inplace=True)

In [None]:
b

In [None]:
b.index = ['clf_svc_best']

In [None]:
b

In [None]:
#X = scaler.fit_transform(X)
classifiers =    [clf_svc_best,rf_clsf_best,xg_clsf_best, clf_cat_boost,clf_lr_best,clf_knn_best]#, rf_clsf_best,xg_clsf_best,clf_cat_boost,clf_lr,clf_knn]
classifiers_names =  ['clf_svc_best','rf_clsf_best','xg_clsf_best', 'clf_cat_boost','clf_lr_best','clf_knn_best']#,'rf_clsf_best','xg_clsf_best', 'clf_cat_boost','clf_lr','clf_knn']
dff = pd.DataFrame()
order = 0 
for classifier in classifiers:
 
  a = classification_report(y_test,classifier.predict(X_test), output_dict=True)['macro avg']#['Pathological']
  b = pd.DataFrame(data = list(a.items())[:3] )
  b.iloc[:,0]
  b = b.transpose()
  b.columns = b.iloc[0]
  b.drop(b.index[0], inplace=True)
  # b.index = [str(type(classifier)).split(".")[-1][:-2]]
  b.index = [classifiers_names[order]] 

  kuku = cross_val_score(classifier, X_train, y_train, cv= StratifiedShuffleSplit(n_splits=10, test_size=0.2, train_size=0.8, random_state=22), scoring='f1_macro')
  kuku_mean = np.mean(kuku)
  kuku_std = np.std(kuku)
  d = {'cv_mean':kuku_mean,'cv_std':kuku_std}
  dd = pd.DataFrame(d.items()).transpose()
  dd.columns = dd.iloc[0]
  dd.drop(dd.index[0], inplace=True)
  dd.index = [classifiers_names[order]] 

  b_dd = b.merge(dd,left_index=True, right_index=True)

  dff = pd.concat([dff,b_dd],axis = 0)
  order +=1
  
dff = dff.sort_values(by=['cv_mean'],ascending= False)
dff