# 
# HEART DISEASE PREDICTION MODEL
# ========================================================

Proyecto Final - Coderhouse. Año 2022
Enrique Guillermo Paz.
pazenriqueguillermo@gmail.com



Introduction

Machine Learning is used around the world in many disciplines/industries. The healthcare industry is no exception. Machine Learning can be used as an essential tool in predicting presence/absence of prevalent diseases. 
Heart Disease is one of the most prevalent diseases, and it's costs in the healthcare system is about $219 billion each year only in the United States. Such information, if predicted well in advance, can provide important insights to doctors who can then adapt their diagnosis and treatment per patient basis and prevent the
onset of the disease, saving a lot of money to the healthcare system.

Objective

Develop a model capable of predicting the onset of heart disease based on risk factors and previous history of the patient.

Dataset Description

The Behavioral Risk Factor Surveillance System (BRFSS) objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, such as the one involved in this prediction model project, wich is Coronary Disease (commonly knowned as Heart Disease/Heart Attack).

Data are collected from a random sample of adults (one per household) through a telephone survey. BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories.BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

We'll be using the 2015 year dataset. The dataset originally has 330 features (columns) regarding of risk factors for chronic health conditions not only Heart Disease but other prevalent diseases such as Diabetes.

A brief description of the variables used is available in variables description.txt file and full description of all of the variables is available in BRFSS pdf document.

In [1]:
# Importing libraries

import numpy as np
import pandas as pd

# 
# 1. Importing the data.
#

In [2]:
# Creating database from .csv file

df_raw = pd.read_csv ('2015.csv')

In [4]:
# Beginning exploration of data base - How many rows and columsn has.

df_raw.shape

(441456, 330)

In [5]:
# Checking that the data loaded in is in the correct format
pd.set_option('display.max_columns', 350)
df_raw.head (10)

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,PVTRESD1,COLGHOUS,STATERES,CELLFON3,LADULT,NUMADULT,NUMMEN,NUMWOMEN,CTELNUM1,CELLFON2,CADULT,PVTRESD2,CCLGHOUS,CSTATE,LANDLINE,HHADULT,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,BPHIGH4,BPMEDS,BLOODCHO,CHOLCHK,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD1,HAVARTH3,ADDEPEV2,CHCKIDNY,DIABETE3,DIABAGE2,SEX,MARITAL,EDUCA,RENTHOM1,NUMHHOL2,NUMPHON2,CPDEMO1,VETERAN3,EMPLOY1,CHILDREN,INCOME2,INTERNET,WEIGHT2,HEIGHT3,PREGNANT,QLACTLM2,USEEQUIP,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,STOPSMK2,LASTSMK2,USENOW3,ALCDAY5,AVEDRNK2,DRNK3GE5,MAXDRNKS,FRUITJU1,FRUIT1,FVBEANS,FVGREEN,FVORANG,VEGETAB1,EXERANY2,EXRACT11,EXEROFT1,EXERHMM1,EXRACT21,EXEROFT2,EXERHMM2,STRENGTH,LMTJOIN3,ARTHDIS2,ARTHSOCL,JOINPAIN,SEATBELT,FLUSHOT6,FLSHTMY2,IMFVPLAC,PNEUVAC3,HIVTST6,HIVTSTD3,WHRTST10,PDIABTST,PREDIAB1,INSULIN,BLDSUGAR,FEETCHK2,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM,DIABEYE,DIABEDU,PAINACT2,QLMENTL2,QLSTRES2,QLHLTH2,CAREGIV1,CRGVREL1,CRGVLNG1,CRGVHRS1,CRGVPRB1,CRGVPERS,CRGVHOUS,CRGVMST2,CRGVEXPT,VIDFCLT2,VIREDIF3,VIPRFVS2,VINOCRE2,VIEYEXM2,VIINSUR2,VICTRCT4,VIGLUMA2,VIMACDG2,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,WTCHSALT,LONGWTCH,DRADVISE,ASTHMAGE,ASATTACK,ASERVIST,ASDRVIST,ASRCHKUP,ASACTLIM,ASYMPTOM,ASNOSLEP,ASTHMED3,ASINHALR,HAREHAB1,STREHAB1,CVDASPRN,ASPUNSAF,RLIVPAIN,RDUCHART,RDUCSTRK,ARTTODAY,ARTHWGT,ARTHEXER,ARTHEDU,TETANUS,HPVADVC2,HPVADSHT,SHINGLE2,HADMAM,HOWLONG,HADPAP2,LASTPAP2,HPVTEST,HPLSTTST,HADHYST2,PROFEXAM,LENGEXAM,BLDSTOOL,LSTBLDS3,HADSIGM3,HADSGCO1,LASTSIG3,PCPSAAD2,PCPSADI1,PCPSARE1,PSATEST1,PSATIME,PCPSARS1,PCPSADE1,PCDMDECN,SCNTMNY1,SCNTMEL1,SCNTPAID,SCNTWRK1,SCNTLPAD,SCNTLWK1,SXORIENT,TRNSGNDR,RCSGENDR,RCSRLTN2,CASTHDX2,CASTHNO2,EMTSUPRT,LSATISFY,ADPLEASR,ADDOWN,ADSLEEP,ADENERGY,ADEAT1,ADFAIL,ADTHINK,ADMOVE,MISTMNT,ADANXEV,QSTVER,QSTLANG,EXACTOT1,EXACTOT2,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_CHISPNC,_CRACE1,_CPRACE,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT,_RFHLTH,_HCVU651,_RFHYPE5,_CHOLCHK,_RFCHOL,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_PRACE1,_MRACE1,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACE_G1,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWEK,_RFDRHV5,FTJUDA1_,FRUTDA1_,BEANDAY_,GRENDAY_,ORNGDAY_,VEGEDA1_,_MISFRTN,_MISVEGN,_FRTRESP,_VEGRESP,_FRUTSUM,_VEGESUM,_FRTLT1,_VEGLT1,_FRT16,_VEG23,_FRUITEX,_VEGETEX,_TOTINDA,METVL11_,METVL21_,MAXVO2_,FC60_,ACTIN11_,ACTIN21_,PADUR1_,PADUR2_,PAFREQ1_,PAFREQ2_,_MINAC11,_MINAC21,STRFREQ_,PAMISS1_,PAMIN11_,PAMIN21_,PA1MIN_,PAVIG11_,PAVIG21_,PA1VIGM_,_PACAT1,_PAINDX1,_PA150R2,_PA300R2,_PA30021,_PASTRNG,_PAREC1,_PASTAE1,_LMTACT1,_LMTWRK1,_LMTSCL1,_RFSEAT2,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3
0,1.0,1.0,b'01292015',b'01',b'29',b'2015',1200.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,3.0,1.0,2.0,,,,,,,,,5.0,15.0,18.0,10.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,88.0,3.0,2.0,280.0,510.0,,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,,2.0,3.0,888.0,,,,305.0,310.0,320.0,310.0,305.0,101.0,2.0,,,,,,,888.0,1.0,1.0,1.0,6.0,1.0,1.0,112014.0,1.0,1.0,1.0,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,70.0,178.0,12701.0,4018.0,4.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,17.0,33.0,67.0,33.0,17.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,50.0,217.0,2.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
1,1.0,1.0,b'01202015',b'01',b'20',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,3.0,88.0,88.0,,2.0,1.0,1.0,4.0,3.0,,1.0,4.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,2.0,6.0,1.0,2.0,,2.0,2.0,3.0,88.0,1.0,1.0,165.0,508.0,,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,,3.0,888.0,,,,302.0,305.0,302.0,202.0,202.0,304.0,1.0,64.0,212.0,100.0,69.0,212.0,100.0,888.0,,,,,3.0,2.0,,,2.0,2.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,1.0,,,,,,,,,,1.0,5.0,5.0,,5.0,2.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,,,,,,,,b'',1.0,2.0,,,2.0,60.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.78156,1.0,28.78156,,,,,9.0,,108.060903,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,7.0,1.0,52.0,4.0,68.0,173.0,7484.0,2509.0,3.0,2.0,1.0,4.0,1.0,1.0,2.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,7.0,17.0,7.0,29.0,29.0,13.0,5.397605e-79,5.397605e-79,1.0,1.0,24.0,78.0,2.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,5.397605e-79,2876.0,493.0,1.0,5.397605e-79,60.0,60.0,2800.0,2800.0,168.0,5.397605e-79,5.397605e-79,5.397605e-79,168.0,5.397605e-79,168.0,5.397605e-79,5.397605e-79,5.397605e-79,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,2.0,2.0,,,2.0
2,1.0,1.0,b'02012015',b'02',b'01',b'2015',1200.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,4.0,15.0,88.0,88.0,1.0,2.0,2.0,1.0,3.0,,1.0,1.0,1.0,7.0,2.0,1.0,2.0,,2.0,1.0,2.0,1.0,2.0,2.0,3.0,,2.0,2.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,99.0,2.0,158.0,511.0,,2.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.78156,2.0,57.56312,,,,,1.0,0.614125,255.264797,2.0,9.0,1.0,1.0,2.0,,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,11.0,2.0,71.0,6.0,71.0,180.0,7167.0,2204.0,2.0,1.0,1.0,2.0,9.0,9.0,9.0,9.0,900.0,9.0,99900.0,9.0,,,,,,,2.0,4.0,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.0,1.0,9.0,,,2173.0,373.0,,,,,,,,,,9.0,,,,,,,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,
3,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,3.0,1.0,2.0,,,,,,,,,5.0,30.0,30.0,30.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,1.0,8.0,2.0,180.0,507.0,,1.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,,,,3.0,888.0,,,,555.0,101.0,555.0,301.0,301.0,201.0,2.0,,,,,,,888.0,1.0,1.0,1.0,8.0,1.0,1.0,777777.0,5.0,1.0,9.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,1.0,1.0,1.0,2.0,1.0,1.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,1.0,,,,,,,,b'',4.0,7.0,,,,97.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,67.0,170.0,8165.0,2819.0,3.0,2.0,2.0,2.0,5.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,5.397605e-79,100.0,5.397605e-79,3.0,3.0,14.0,5.397605e-79,5.397605e-79,1.0,1.0,100.0,20.0,1.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,9.0
4,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,5.0,20.0,88.0,30.0,1.0,1.0,2.0,1.0,3.0,,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,3.0,,2.0,1.0,5.0,1.0,2.0,,2.0,2.0,8.0,88.0,77.0,1.0,142.0,504.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,777.0,102.0,203.0,204.0,310.0,320.0,2.0,,,,,,,888.0,1.0,1.0,1.0,7.0,1.0,2.0,,,1.0,1.0,777777.0,1.0,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,7.0,,,,,,,,,,2.0,,,,,,1.0,777.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,5.0,,,,,,,,b'',5.0,5.0,,,,45.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,2.0,57.56312,,,,,9.0,,258.682223,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,61.0,5.0,64.0,163.0,6441.0,2437.0,2.0,1.0,1.0,3.0,9.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,200.0,43.0,57.0,33.0,67.0,1.0,5.397605e-79,5.397605e-79,1.0,,200.0,9.0,1.0,1.0,1.0,1.0,5.397605e-79,2.0,,,2543.0,436.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
5,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,2.0,88.0,88.0,,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,3.0,,2.0,3.0,3.0,1.0,2.0,,2.0,2.0,2.0,88.0,6.0,2.0,145.0,502.0,,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,101.0,101.0,102.0,101.0,102.0,101.0,1.0,18.0,101.0,100.0,73.0,107.0,30.0,888.0,1.0,2.0,3.0,4.0,1.0,1.0,112014.0,1.0,1.0,2.0,,,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,2.0,,,,,,1.0,415.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,1.0,,,,,,,,b'',5.0,5.0,1.0,54.0,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.78156,1.0,28.78156,,,,,9.0,,256.518591,1.0,9.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,11.0,2.0,73.0,6.0,62.0,157.0,6577.0,2652.0,3.0,2.0,1.0,1.0,4.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,100.0,100.0,200.0,100.0,200.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,200.0,600.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,50.0,33.0,2099.0,360.0,2.0,1.0,60.0,30.0,1000.0,7000.0,60.0,210.0,5.397605e-79,5.397605e-79,120.0,210.0,330.0,60.0,5.397605e-79,60.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,2.0
6,1.0,1.0,b'01052015',b'01',b'05',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,2.0,88.0,3.0,88.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,3.0,5.0,1.0,2.0,,2.0,2.0,7.0,88.0,4.0,2.0,148.0,506.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,203.0,1.0,88.0,1.0,330.0,202.0,302.0,202.0,302.0,205.0,1.0,18.0,102.0,100.0,64.0,107.0,15.0,888.0,,,,,1.0,2.0,,,1.0,1.0,91991.0,4.0,2.0,3.0,,,,,,,,,,,,,,8.0,,,,,,,,,,,,,,,,,,2.0,,,,,,1.0,403.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,5.0,1.0,1.0,6.0,,,,,,,,b'',4.0,5.0,,,2.0,40.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',1.0,11011.0,28.78156,1.0,28.78156,,,,,9.0,,85.659755,1.0,9.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,2.0,1.0,7.0,2.0,7.0,2.0,4.0,5.0,11.0,2.0,70.0,6.0,66.0,168.0,6713.0,2389.0,2.0,1.0,1.0,3.0,2.0,4.0,1.0,1.0,10.0,1.0,70.0,1.0,100.0,29.0,7.0,29.0,7.0,71.0,5.397605e-79,5.397605e-79,1.0,1.0,129.0,114.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,50.0,35.0,2210.0,379.0,2.0,1.0,60.0,15.0,2000.0,7000.0,120.0,105.0,5.397605e-79,5.397605e-79,240.0,105.0,345.0,120.0,5.397605e-79,120.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,3.0,4.0,1.0,1.0,2.0,1.0,1.0
7,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,5.0,8.0,88.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,7.0,3.0,,1.0,1.0,3.0,1.0,2.0,,1.0,1.0,3.0,88.0,3.0,2.0,179.0,501.0,,1.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,3.0,,7.0,3.0,888.0,,,,102.0,101.0,202.0,101.0,303.0,202.0,1.0,64.0,106.0,12.0,98.0,107.0,5.0,888.0,1.0,1.0,1.0,77.0,1.0,1.0,122014.0,1.0,1.0,1.0,777777.0,4.0,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,2.0,,,,,,1.0,777.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,4.0,,,,,,,,b'',4.0,2.0,,,1.0,45.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'Treadmill',5.0,11011.0,28.78156,2.0,57.56312,,,,,1.0,0.614125,545.782095,2.0,9.0,2.0,1.0,2.0,,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,13.0,2.0,80.0,6.0,61.0,155.0,8119.0,3382.0,4.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,200.0,100.0,29.0,100.0,10.0,29.0,5.397605e-79,5.397605e-79,1.0,1.0,300.0,168.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,45.0,1545.0,265.0,2.0,2.0,12.0,5.0,6000.0,7000.0,72.0,5.397605e-79,5.397605e-79,5.397605e-79,144.0,5.397605e-79,144.0,72.0,5.397605e-79,72.0,3.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,1.0,1.0,b'01132015',b'01',b'13',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,5.0,77.0,88.0,77.0,1.0,1.0,2.0,1.0,3.0,,7.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,3.0,,2.0,3.0,3.0,1.0,2.0,,1.0,2.0,5.0,88.0,77.0,2.0,84.0,503.0,,1.0,1.0,2.0,2.0,7.0,2.0,2.0,2.0,,,,3.0,888.0,,,,777.0,777.0,302.0,302.0,777.0,777.0,1.0,98.0,103.0,100.0,88.0,,,777.0,2.0,1.0,2.0,6.0,1.0,1.0,777777.0,1.0,1.0,2.0,,,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,2.0,,,,,,7.0,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,,,,,,,,b'',5.0,5.0,,,,98.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'Physical Therapy',b'',1.0,11011.0,28.78156,1.0,28.78156,,,,,1.0,0.614125,211.210295,2.0,9.0,1.0,9.0,,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,13.0,2.0,80.0,6.0,63.0,160.0,3810.0,1488.0,1.0,1.0,1.0,1.0,9.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,,7.0,7.0,,,2.0,2.0,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.0,1.0,1.0,45.0,5.397605e-79,1618.0,277.0,2.0,5.397605e-79,60.0,,3000.0,,180.0,5.397605e-79,,5.397605e-79,360.0,5.397605e-79,360.0,180.0,5.397605e-79,180.0,1.0,1.0,1.0,1.0,1.0,9.0,9.0,9.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0
9,1.0,1.0,b'01302015',b'01',b'30',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,2.0,2.0,88.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,1.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,8.0,1.0,161.0,507.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,7.0,3.0,888.0,,,,201.0,101.0,204.0,205.0,206.0,101.0,1.0,64.0,106.0,50.0,88.0,,,888.0,,,,,2.0,2.0,,,2.0,2.0,,,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,2.0,,,,,,1.0,415.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,2.0,,,,,,,,b'',5.0,5.0,,,1.0,60.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',1.0,11011.0,28.78156,2.0,57.56312,,,,,1.0,0.614125,215.472863,1.0,9.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,10.0,2.0,68.0,6.0,67.0,170.0,7303.0,2522.0,3.0,2.0,1.0,4.0,5.0,3.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,14.0,100.0,57.0,71.0,86.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,114.0,314.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,5.397605e-79,2260.0,387.0,1.0,5.397605e-79,50.0,,6000.0,,300.0,5.397605e-79,5.397605e-79,5.397605e-79,300.0,5.397605e-79,300.0,5.397605e-79,5.397605e-79,5.397605e-79,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,1.0,2.0,2.0,2.0,2.0


In [6]:
#At this point we have 441,456 records and 330 columns. Each record contains an individual's BRFSS survey responses.

In [7]:
df_raw.to_csv ('heart_disease_raw.csv')

In [8]:
#Selecting the variables related to Heart Disease risk factors (see description above).

df = df_raw[['_MICHD', '_RFHYPE5', 'TOLDHI2','WEIGHT2', '_BMI5', 'SMOKE100', 'CVDSTRK3', 'DIABETE3', '_TOTINDA', 
                    '_FRTLT1', '_VEGLT1', '_RFDRHV5', 'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
                    'SEX', '_AGEG5YR']]

In [9]:
df.shape

(441456, 18)

In [10]:
df.head (10)

Unnamed: 0,_MICHD,_RFHYPE5,TOLDHI2,WEIGHT2,_BMI5,SMOKE100,CVDSTRK3,DIABETE3,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR
0,2.0,2.0,1.0,280.0,4018.0,1.0,2.0,3.0,2.0,2.0,1.0,1.0,5.0,18.0,15.0,1.0,2.0,9.0
1,2.0,1.0,2.0,165.0,2509.0,1.0,2.0,3.0,1.0,2.0,2.0,1.0,3.0,88.0,88.0,2.0,2.0,7.0
2,,1.0,1.0,158.0,2204.0,,1.0,3.0,9.0,9.0,9.0,9.0,4.0,88.0,15.0,,2.0,11.0
3,2.0,2.0,1.0,180.0,2819.0,2.0,2.0,3.0,2.0,1.0,2.0,1.0,5.0,30.0,30.0,1.0,2.0,9.0
4,2.0,1.0,2.0,142.0,2437.0,2.0,2.0,3.0,2.0,9.0,1.0,1.0,5.0,88.0,20.0,2.0,2.0,9.0
5,2.0,2.0,2.0,145.0,2652.0,2.0,2.0,3.0,1.0,1.0,1.0,1.0,2.0,88.0,88.0,2.0,2.0,11.0
6,2.0,2.0,1.0,148.0,2389.0,2.0,2.0,3.0,1.0,1.0,1.0,1.0,2.0,3.0,88.0,2.0,2.0,11.0
7,,2.0,1.0,179.0,3382.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,5.0,88.0,8.0,1.0,1.0,13.0
8,2.0,1.0,,84.0,1488.0,2.0,2.0,3.0,1.0,9.0,9.0,1.0,5.0,88.0,77.0,7.0,2.0,13.0
9,2.0,2.0,1.0,161.0,2522.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,2.0,88.0,2.0,2.0,1.0,10.0


In [11]:
# At this point we have 441,456 records and 18 columns.

In [12]:
df.to_csv ('heart_disease_dropcol.csv')

# 
# 2. Cleaning the data.
#

In [13]:
#Checking for missing values.

df.isnull ().sum ()

_MICHD       3942
_RFHYPE5        0
TOLDHI2     59154
WEIGHT2      5315
_BMI5       36398
SMOKE100    14255
CVDSTRK3        0
DIABETE3        7
_TOTINDA        0
_FRTLT1         0
_VEGLT1         0
_RFDRHV5        0
GENHLTH         2
MENTHLTH        0
PHYSHLTH        1
DIFFWALK    12334
SEX             0
_AGEG5YR        0
dtype: int64

In [14]:
# We can try to express the missing values in porcentaje of our dataset to get a better idea of how much of our data represents.


df.isnull (). sum () /441456 * 100

_MICHD       0.892954
_RFHYPE5     0.000000
TOLDHI2     13.399750
WEIGHT2      1.203970
_BMI5        8.244989
SMOKE100     3.229087
CVDSTRK3     0.000000
DIABETE3     0.001586
_TOTINDA     0.000000
_FRTLT1      0.000000
_VEGLT1      0.000000
_RFDRHV5     0.000000
GENHLTH      0.000453
MENTHLTH     0.000000
PHYSHLTH     0.000227
DIFFWALK     2.793936
SEX          0.000000
_AGEG5YR     0.000000
dtype: float64

In [15]:
# Since we are working with a large data set and the missing values represents a small percentage of our data, 
# we can simple drop teh rows with missing values instead of replacing the missing values.


df = df.dropna()
df.shape

(343607, 18)

In [16]:
# We now have 343607 rows.

In [17]:
# Modifying the variables coding to make it more suitable for Machine Learning Algorithms.

#1.  '_MICHD' (Target Variable)
# Change 2 to 0 because this means did not have Heart Disease

df['_MICHD'] = df['_MICHD'].replace({2: 0})
df['_MICHD'].unique ()

array([0., 1.])

In [18]:
#2.  _RFHYPE5
#Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure.

df['_RFHYPE5'] = df['_RFHYPE5'].replace ({1:0, 2:1})
df['_RFHYPE5'].unique ()

array([1., 0., 9.])

In [19]:
#Removing foreign values.

df = df[df._RFHYPE5 != 9]
df['_RFHYPE5'].unique ()

array([1., 0.])

In [20]:
#3. TOLDHI2
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)

df['TOLDHI2'] = df['TOLDHI2'].replace({2: 0})
df = df[df.TOLDHI2 != 7]
df = df[df.TOLDHI2 != 9]
df['TOLDHI2'].unique ()

array([1., 0.])

In [21]:
#4.  _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 3520 is really 35.20)


df['_BMI5'] = df['_BMI5'].div (100).round ()
df['_BMI5'].unique()

array([40., 25., 28., 24., 27., 30., 26., 23., 34., 31., 33., 21., 22.,
       38., 20., 19., 32., 46., 41., 37., 36., 29., 35., 18., 54., 45.,
       39., 47., 43., 55., 49., 42., 17., 16., 48., 44., 50., 59., 15.,
       52., 53., 57., 51., 14., 58., 63., 61., 56., 60., 74., 62., 64.,
       13., 66., 73., 65., 68., 85., 71., 84., 67., 70., 82., 79., 92.,
       72., 88., 96., 81., 12., 77., 95., 75., 91., 69., 76., 87., 89.,
       83., 98., 86., 80., 90., 78., 97.])

In [22]:
#5. WEIGHT2
#Removing outliers values

df = df[df.WEIGHT2 < 400]
df['WEIGHT2'].value_counts()

200.0    18269
180.0    16803
150.0    16357
160.0    15392
170.0    14006
         ...  
73.0         2
396.0        1
381.0        1
66.0         1
74.0         1
Name: WEIGHT2, Length: 332, dtype: int64

In [23]:
#6. SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)


df['SMOKE100'] = df['SMOKE100'].replace({2:0})
df = df[df.SMOKE100 != 7]
df = df[df.SMOKE100 != 9]
df['SMOKE100'].unique()

array([1., 0.])

In [24]:
#7. CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)

df['CVDSTRK3'] = df['CVDSTRK3'].replace({2:0})
df = df[df.CVDSTRK3 != 7]
df = df[df.CVDSTRK3 != 9]
df['CVDSTRK3'].unique()

array([0., 1.])

In [25]:
#8. DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)


df['DIABETE3'] = df['DIABETE3'].replace({2:0, 3:0, 1:2, 4:1})
df = df[df.DIABETE3 != 7]
df = df[df.DIABETE3 != 9]
df['DIABETE3'].unique()

array([0., 2., 1.])

In [26]:
#9. _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)


df['_TOTINDA'] = df['_TOTINDA'].replace({2:0})
df = df[df._TOTINDA != 9]
df['_TOTINDA'].unique()

array([0., 1.])

In [27]:
#10. _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9

df['_FRTLT1'] = df['_FRTLT1'].replace({2:0})
df = df[df._FRTLT1 != 9]
df['_FRTLT1'].unique()

array([0., 1.])

In [28]:
#11. _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9


df['_VEGLT1'] = df['_VEGLT1'].replace({2:0})
df = df[df._VEGLT1 != 9]
df['_VEGLT1'].unique()

array([1., 0.])

In [29]:
#12. _RFDRHV5
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9

df['_RFDRHV5'] = df['_RFDRHV5'].replace({1:0, 2:1})
df = df[df._RFDRHV5 != 9]
df['_RFDRHV5'].unique()

array([0., 1.])

In [30]:
#13. GENHLTH
# This is an ordinal variable (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused


df = df[df.GENHLTH != 7]
df = df[df.GENHLTH != 9]
df['GENHLTH'].unique()

array([5., 3., 2., 4., 1.])

In [31]:
#14. MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused


df['MENTHLTH'] = df['MENTHLTH'].replace({88:0})
df = df[df.MENTHLTH != 77]
df = df[df.MENTHLTH != 99]
df['MENTHLTH'].unique()

array([18.,  0., 30.,  3.,  5., 15., 10.,  6., 20.,  2., 25.,  1., 29.,
        4.,  7.,  8., 21., 14., 26.,  9., 16., 28., 11., 12., 24., 17.,
       13., 23., 27., 19., 22.])

In [32]:
#15. PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused


df['PHYSHLTH'] = df['PHYSHLTH'].replace({88:0})
df = df[df.PHYSHLTH != 77]
df = df[df.PHYSHLTH != 99]
df['PHYSHLTH'].unique()

array([15.,  0., 30.,  2., 14., 28.,  7., 20.,  3., 10.,  1.,  5., 17.,
        4., 19.,  6., 21., 12.,  8., 25., 27., 22., 29., 24.,  9., 16.,
       18., 23., 13., 26., 11.])

In [33]:
#16. DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused


df['DIFFWALK'] = df['DIFFWALK'].replace({2:0})
df = df[df.DIFFWALK != 7]
df = df[df.DIFFWALK != 9]
df['DIFFWALK'].unique()

array([1., 0.])

In [34]:
#17. SEX
# change 2 to 0 (female as 0). Male is 1

df['SEX'] = df['SEX'].replace({2:0})
df['SEX'].unique()

array([0., 1.])

In [35]:
#18. _AGEG5YR
# Ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing


df = df[df._AGEG5YR != 14]
df['_AGEG5YR'].unique()

array([ 9.,  7., 11., 10., 13.,  8.,  4.,  6.,  2., 12.,  5.,  1.,  3.])

In [36]:
#Check the shape of the dataset now: We have 290527 cleaned rows and 18 columns
df.shape

(290527, 18)

In [37]:
# Renaming the columns to make them more readble/understandable and easier to work with.
df = df.rename(columns = {'_MICHD':'Heart_Disease', 
                                         '_RFHYPE5':'High_BP',  
                                         'TOLDHI2':'High_Chol', 
                                         '_BMI5':'BMI','WEIGHT2': 'Weight', 
                                         'SMOKE100':'Smoker', 
                                         'CVDSTRK3':'Stroke', 'DIABETE3':'Diabetes', 
                                         '_TOTINDA':'Phys_Activ',  
                                         '_FRTLT1':'Eat_Fruits', '_VEGLT1':"Eat_Veg", 
                                         '_RFDRHV5':'Alcohol', 
                                         'GENHLTH':'Gen_Health', 'MENTHLTH':'Ment_Health', 'PHYSHLTH':'Phys_Health', 'DIFFWALK':'Diff_Walk', 
                                         'SEX':'Sex', '_AGEG5YR':'Age'})

In [38]:
df.head ()

Unnamed: 0,Heart_Disease,High_BP,High_Chol,Weight,BMI,Smoker,Stroke,Diabetes,Phys_Activ,Eat_Fruits,Eat_Veg,Alcohol,Gen_Health,Ment_Health,Phys_Health,Diff_Walk,Sex,Age
0,0.0,1.0,1.0,280.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0
1,0.0,0.0,0.0,165.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,7.0
3,0.0,1.0,1.0,180.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,30.0,30.0,1.0,0.0,9.0
5,0.0,1.0,0.0,145.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0
6,0.0,1.0,1.0,148.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0


# 
# 3.Optimizing columns for memory allocation saving.
#

In [39]:
#Checking memory usage before optimization

df.memory_usage ()

Index            2324216
Heart_Disease    2324216
High_BP          2324216
High_Chol        2324216
Weight           2324216
BMI              2324216
Smoker           2324216
Stroke           2324216
Diabetes         2324216
Phys_Activ       2324216
Eat_Fruits       2324216
Eat_Veg          2324216
Alcohol          2324216
Gen_Health       2324216
Ment_Health      2324216
Phys_Health      2324216
Diff_Walk        2324216
Sex              2324216
Age              2324216
dtype: int64

In [40]:
df.dtypes

Heart_Disease    float64
High_BP          float64
High_Chol        float64
Weight           float64
BMI              float64
Smoker           float64
Stroke           float64
Diabetes         float64
Phys_Activ       float64
Eat_Fruits       float64
Eat_Veg          float64
Alcohol          float64
Gen_Health       float64
Ment_Health      float64
Phys_Health      float64
Diff_Walk        float64
Sex              float64
Age              float64
dtype: object

In [41]:
# All of our data is type float 64 wich is unnecesary. Let's change the variables type.

In [42]:
#Changing the type of boolean variables.

df = df.astype({'Heart_Disease': 'int8','High_BP': 'int8','High_Chol': 'int8','Smoker': 'int8','Stroke': 'int8','Phys_Activ': 'int8','Eat_Fruits': 'int8','Eat_Veg': 'int8','Alcohol': 'int8','Diff_Walk': 'int8','Sex': 'int8'})

In [43]:
df.dtypes

Heart_Disease       int8
High_BP             int8
High_Chol           int8
Weight           float64
BMI              float64
Smoker              int8
Stroke              int8
Diabetes         float64
Phys_Activ          int8
Eat_Fruits          int8
Eat_Veg             int8
Alcohol             int8
Gen_Health       float64
Ment_Health      float64
Phys_Health      float64
Diff_Walk           int8
Sex                 int8
Age              float64
dtype: object

In [44]:
#Changing the type of numeric variables. Since all of the values of the numeric variables are integer
# (none of them got decimals) we are goint to use this type.

df = df.astype({'BMI': 'int8','Diabetes': 'int8','Gen_Health': 'int8','Ment_Health': 'int8','Phys_Health': 'int8','Age': 'int8'})

In [45]:
df.dtypes

Heart_Disease       int8
High_BP             int8
High_Chol           int8
Weight           float64
BMI                 int8
Smoker              int8
Stroke              int8
Diabetes            int8
Phys_Activ          int8
Eat_Fruits          int8
Eat_Veg             int8
Alcohol             int8
Gen_Health          int8
Ment_Health         int8
Phys_Health         int8
Diff_Walk           int8
Sex                 int8
Age                 int8
dtype: object

In [46]:
#Checking memory allocation after the optimization

df.memory_usage ()

Index            2324216
Heart_Disease     290527
High_BP           290527
High_Chol         290527
Weight           2324216
BMI               290527
Smoker            290527
Stroke            290527
Diabetes          290527
Phys_Activ        290527
Eat_Fruits        290527
Eat_Veg           290527
Alcohol           290527
Gen_Health        290527
Ment_Health       290527
Phys_Health       290527
Diff_Walk         290527
Sex               290527
Age               290527
dtype: int64

In [47]:
# Now our data frame is ready for exploration. We are going to save it.

df.to_csv ('heart_disease_clean.csv')

In [48]:
df.head (5)

Unnamed: 0,Heart_Disease,High_BP,High_Chol,Weight,BMI,Smoker,Stroke,Diabetes,Phys_Activ,Eat_Fruits,Eat_Veg,Alcohol,Gen_Health,Ment_Health,Phys_Health,Diff_Walk,Sex,Age
0,0,1,1,280.0,40,1,0,0,0,0,1,0,5,18,15,1,0,9
1,0,0,0,165.0,25,1,0,0,1,0,0,0,3,0,0,0,0,7
3,0,1,1,180.0,28,0,0,0,0,1,0,0,5,30,30,1,0,9
5,0,1,0,145.0,27,0,0,0,1,1,1,0,2,0,0,0,0,11
6,0,1,1,148.0,24,0,0,0,1,1,1,0,2,3,0,0,0,11


In [49]:
df.shape

(290527, 18)

In [51]:
df['Heart_Disease'].value_counts (normalize = True) * 100

0    90.41707
1     9.58293
Name: Heart_Disease, dtype: float64