#### 1.1 Data Dictionary <a id=2></a>
`age` - Age of the patient

`sex` - Sex of the patient

`cp` - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

`trtbps` - Resting blood pressure (in mm Hg)

`chol` - Cholestoral in mg/dl fetched via BMI sensor

`fbs` - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

`restecg` - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

`thalachh`  - Maximum heart rate achieved

`oldpeak` - Previous peak

`slp` - Slope

`caa` - Number of major vessels 

`thall` - Thalium Stress Test result ~ (0,3)

`exng` - Exercise induced angina ~ 1 = Yes, 0 = No

`output` - Target variable

#### 2.1 Packages <a id=5></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

#### 2.2 Data <a id=6></a>

In [4]:
df = pd.read_csv(r"C:\Users\Wesley Ribeiro\OneDrive\Documentos\previne-bem\dados\heart.csv")

In [5]:
df.columns.values

array(['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'], dtype=object)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [7]:
df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


#### 2.3 Understanding Data <a id=7></a>

##### 2.3.1 The shape of the data

In [8]:
print("The shape of the dataset is : ", df.shape)

The shape of the dataset is :  (303, 14)


##### 2.3.3 Checking the number of unique values in each column

In [9]:
dict = {}
for i in list(df.columns):
    dict[i] = df[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()

Unnamed: 0,unique count
age,41
sex,2
cp,4
trtbps,49
chol,152
fbs,2
restecg,3
thalachh,91
exng,2
oldpeak,40


##### 2.3.4 Separating the columns in categorical and continuous

In [10]:
cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
target_col = ["output"]
print("The categorial cols are : ", cat_cols)
print("The continuous cols are : ", con_cols)
print("The target variable is :  ", target_col)

The categorial cols are :  ['sex', 'exng', 'caa', 'cp', 'fbs', 'restecg', 'slp', 'thall']
The continuous cols are :  ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']
The target variable is :   ['output']


#### 4.2 Packages <a id=13></a>

In [11]:
# Scaling
from sklearn.preprocessing import RobustScaler, StandardScaler, OneHotEncoder

# Train Test Split
from sklearn.model_selection import train_test_split

# Models
import torch
import torch.nn as nn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.compose import ColumnTransformer

# Metrics
from sklearn.metrics import accuracy_score, classification_report, roc_curve

# Cross Validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from imblearn.pipeline import Pipeline as imbPipeline

print('Packages imported...')

Packages imported...



#### 4.3 Making features model ready <a id=14></a>

##### 4.3.1 Scaling and Encoding features

In [12]:
# creating a copy of df
df1 = df

# define the columns to be encoded and scaled
cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]

# # encoding the categorical columns
# df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

# # defining the features and target
# X = df1.drop(['output'],axis=1)
# y = df1[['output']]

# # instantiating the scaler
# scaler = RobustScaler()

# # scaling the continuous featuree
# X[con_cols] = scaler.fit_transform(X[con_cols])
# print("The first 5 rows of X are")
# X.columns

In [14]:
# Define preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), con_cols),
        ('cat', OneHotEncoder(), cat_cols)
    ])

# Split data into features and target variable
X = df1.drop('output', axis=1)
y = df1['output']

In [15]:
# Create a pipeline that preprocesses the data, resamples data, and then trains a classifier
logreg = imbPipeline(steps=[('preprocessor', preprocessor),
                      ('logreg', LogisticRegression())])

##### 4.3.2 Train and test split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
print("The shape of X_train is      ", X_train.shape)
print("The shape of X_test is       ",X_test.shape)
print("The shape of y_train is      ",y_train.shape)
print("The shape of y_test is       ",y_test.shape)

The shape of X_train is       (242, 13)
The shape of X_test is        (61, 13)
The shape of y_train is       (242,)
The shape of y_test is        (61,)


##### 5.1.3 Logistic Regression

In [17]:
# instantiating the object
# logreg = LogisticRegression()

# fitting the object
logreg.fit(X_train, y_train)

# calculating the probabilities
y_pred_proba = logreg.predict_proba(X_test)

# finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)

# printing the test accuracy
print("The test accuracy score of Logistric Regression is ", accuracy_score(y_test, y_pred))

The test accuracy score of Logistric Regression is  0.8852459016393442


In [18]:
y_pred

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [19]:
y_pred_proba

array([[0.98500435, 0.01499565],
       [0.47509571, 0.52490429],
       [0.34598742, 0.65401258],
       [0.98832553, 0.01167447],
       [0.04279064, 0.95720936],
       [0.05699215, 0.94300785],
       [0.38566788, 0.61433212],
       [0.99755234, 0.00244766],
       [0.99431434, 0.00568566],
       [0.52886763, 0.47113237],
       [0.44570652, 0.55429348],
       [0.87677933, 0.12322067],
       [0.05380347, 0.94619653],
       [0.95482713, 0.04517287],
       [0.00823325, 0.99176675],
       [0.0410098 , 0.9589902 ],
       [0.01447998, 0.98552002],
       [0.97334437, 0.02665563],
       [0.99684524, 0.00315476],
       [0.99232964, 0.00767036],
       [0.48728507, 0.51271493],
       [0.9320271 , 0.0679729 ],
       [0.59781332, 0.40218668],
       [0.25516769, 0.74483231],
       [0.18956541, 0.81043459],
       [0.40085989, 0.59914011],
       [0.07823661, 0.92176339],
       [0.32844125, 0.67155875],
       [0.97953226, 0.02046774],
       [0.03777036, 0.96222964],
       [0.

In [20]:
f

NameError: name 'f' is not defined

In [21]:
import os
import pickle

In [22]:
pickle.dump(logreg, open( os.path.join('../models', "model_ha.pkl"), "wb" ))