# Analyzing Behavioral_Risk_Factor_Surveillance_System__BRFSS__-__National_Cardiovascular_Disease_Surveillance_Data_20240112.csv

Location: /work/shibberu/share/MA384_Data_Mining_Projects_Winter_2023-24/CDSA

Website:https://data.cdc.gov/Heart-Disease-Stroke-Prevention/Rates-and-Trends-in-Hypertension-related-Cardiovas/uc9k-vc2j/about_data

Data Description:

This dataset documents rates and trends in local hypertension-related cardiovascular disease (CVD) death rates. Specifically, this report presents county (or county equivalent) estimates of hypertension-related CVD death rates in 2000-2019 and trends during two intervals (2000-2010, 2010-2019) by age group (ages 35â€“64 years, ages 65 years and older), race/ethnicity (non-Hispanic American Indian/Alaska Native, non-Hispanic Asian/Pacific Islander, non-Hispanic Black, Hispanic, non-Hispanic White), and sex (female, male). The rates and trends were estimated using a Bayesian spatiotemporal model and a smoothed over space, time, and demographic group. Rates are age-standardized in 10-year age groups using the 2010 US population. Data source: National Vital Statistics System.

BRFFS - 
Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 35)

In [2]:
path = '/work/shibberu/share/MA384_Data_Mining_Projects_Winter_2023-24/CDSA/Behavioral_Risk_Factor_Surveillance_System__BRFSS__-__National_Cardiovascular_Disease_Surveillance_Data_20240112.csv'

df = pd.read_csv(path)
# df.set_index('RowId', inplace=True)
df.YearStart = pd.to_datetime(df.YearStart).dt.year

df.head()

  df = pd.read_csv(path)


Unnamed: 0,RowId,YearStart,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Class,Topic,Question,Data_Value_Type,Data_Value_Unit,Data_Value,Data_Value_Alt,Data_Value_Footnote_Symbol,Data_Value_Footnote,Low_Confidence_Limit,High_Confidence_Limit,Break_Out_Category,Break_Out,ClassId,TopicId,QuestionId,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationId,Geolocation
0,BRFSS~2011~01~BR001~OVR01~Age-Standardized,1970,AL,Alabama,BRFSS,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease amo...,Age-Standardized,Percent (%),9.9,9.9,,,9.2,10.7,Overall,Overall,C1,T1,BR001,AgeStdz,BOC01,OVR01,1,POINT (-86.63186076199969 32.84057112200048)
1,BRFSS~2011~01~BR001~OVR01~Crude,1970,AL,Alabama,BRFSS,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease amo...,Crude,Percent (%),11.0,11.0,,,10.2,11.9,Overall,Overall,C1,T1,BR001,Crude,BOC01,OVR01,1,POINT (-86.63186076199969 32.84057112200048)
2,BRFSS~2011~01~BR001~GEN01~Crude,1970,AL,Alabama,BRFSS,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease amo...,Crude,Percent (%),12.5,12.5,,,11.1,14.0,Gender,Male,C1,T1,BR001,Crude,BOC02,GEN01,1,POINT (-86.63186076199969 32.84057112200048)
3,BRFSS~2011~01~BR001~GEN01~Age-Standardized,1970,AL,Alabama,BRFSS,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease amo...,Age-Standardized,Percent (%),11.8,11.8,,,10.6,13.2,Gender,Male,C1,T1,BR001,AgeStdz,BOC02,GEN01,1,POINT (-86.63186076199969 32.84057112200048)
4,BRFSS~2011~01~BR001~GEN02~Age-Standardized,1970,AL,Alabama,BRFSS,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease amo...,Age-Standardized,Percent (%),8.3,8.3,,,7.5,9.1,Gender,Female,C1,T1,BR001,AgeStdz,BOC02,GEN02,1,POINT (-86.63186076199969 32.84057112200048)


In [3]:
category_columns = ['LocationAbbr', 'LocationDesc', 'DataSource', 'PriorityArea1',  'PriorityArea3', 'Class', 'Topic', 'Question', 'Data_Value_Type', 'Data_Value_Footnote', 'Break_Out_Category', 'Break_Out', 'ClassId', 'TopicId', 'QuestionId', 'Data_Value_TypeID', 'BreakOutCategoryId', 'BreakOutId', 'LocationId']

for cat in category_columns:
    df[cat] = df[cat].astype('category')

df.Geolocation = df.Geolocation.str.strip("POINT (*)")
df[['Latitude', 'Longitude']] = df['Geolocation'].str.split(" ", expand=True)
df['Latitude'] = df.Latitude.astype(str)
df['Longitude'] = df.Longitude.astype(str)


df.dtypes

RowId                           object
YearStart                        int32
LocationAbbr                  category
LocationDesc                  category
DataSource                    category
PriorityArea1                 category
PriorityArea2                  float64
PriorityArea3                 category
PriorityArea4                  float64
Class                         category
Topic                         category
Question                      category
Data_Value_Type               category
Data_Value_Unit                 object
Data_Value                     float64
Data_Value_Alt                 float64
Data_Value_Footnote_Symbol      object
Data_Value_Footnote           category
Low_Confidence_Limit           float64
High_Confidence_Limit          float64
Break_Out_Category            category
Break_Out                     category
ClassId                       category
TopicId                       category
QuestionId                    category
Data_Value_TypeID        

In [4]:
df.shape

(160160, 32)

In [5]:
nan_count_per_column = df.isna().sum().sort_values(ascending=False)
print("PERCENT NaN:", '\n')
print(100* (nan_count_per_column / 160160))

PERCENT NaN: 

PriorityArea4                 100.000000
PriorityArea2                 100.000000
PriorityArea1                  85.714286
High_Confidence_Limit          52.352023
Low_Confidence_Limit           52.352023
Data_Value                     51.150724
PriorityArea3                  50.000000
Data_Value_Footnote            48.849276
Data_Value_Footnote_Symbol     48.849276
Geolocation                     1.923077
BreakOutId                      0.000000
Break_Out_Category              0.000000
Data_Value_TypeID               0.000000
QuestionId                      0.000000
TopicId                         0.000000
LocationId                      0.000000
ClassId                         0.000000
Break_Out                       0.000000
Latitude                        0.000000
BreakOutCategoryId              0.000000
RowId                           0.000000
YearStart                       0.000000
Data_Value_Alt                  0.000000
Data_Value_Unit                 0.000000
D

In [6]:
duplicate_count = df.duplicated().sum()
total_rows = len(df)
percentage_duplicates = (duplicate_count / total_rows) * 100
print("Percent Duplicates: ", percentage_duplicates.round(2), "  Duplicate Count", duplicate_count, "    Total Rows", total_rows )

Percent Duplicates:  0.0   Duplicate Count 0     Total Rows 160160


In [7]:
duplicate_df = df[df.duplicated(keep=False)].sort_values(by = ['YearStart', 'LocationAbbr', 'LocationDesc', 'DataSource',
       'PriorityArea1', 'PriorityArea2', 'PriorityArea3', 'PriorityArea4',
       'Class', 'Topic', 'Question', 'Data_Value_Type', 'Data_Value_Unit',
       'Data_Value', 'Data_Value_Alt', 'Data_Value_Footnote_Symbol',
       'Data_Value_Footnote', 'Low_Confidence_Limit', 'High_Confidence_Limit',
       'Break_Out_Category', 'Break_Out', 'ClassId', 'TopicId', 'QuestionId',
       'Data_Value_TypeID', 'BreakOutCategoryId', 'BreakOutId', 'LocationId',
       'Geolocation', 'Latitude', 'Longitude'])
duplicate_df.head(20)

Unnamed: 0,RowId,YearStart,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Class,Topic,Question,Data_Value_Type,Data_Value_Unit,Data_Value,Data_Value_Alt,Data_Value_Footnote_Symbol,Data_Value_Footnote,Low_Confidence_Limit,High_Confidence_Limit,Break_Out_Category,Break_Out,ClassId,TopicId,QuestionId,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationId,Geolocation,Latitude,Longitude


In [8]:
for col in df.columns:
    print(df[col].value_counts().sort_values())
    print()

RowId
BRFSS~2011~01~BR001~OVR01~Age-Standardized    1
BRFSS~2011~01~BR001~RAC04~Crude               1
BRFSS~2011~01~BR001~RAC04~Age-Standardized    1
BRFSS~2011~01~BR001~RAC03~Age-Standardized    1
BRFSS~2011~01~BR001~GEN01~Crude               1
                                             ..
BRFSS~2020~58~BR012~RAC02~Crude               1
BRFSS~2020~58~BR012~RAC03~Crude               1
BRFSS~2020~58~BR012~RAC03~Age-Standardized    1
BRFSS~2020~58~BR012~AGE07~Crude               1
BRFSS~2020~58~BR012~RAC07~Age-Standardized    1
Name: count, Length: 160160, dtype: int64

YearStart
1970    160160
Name: count, dtype: int64

LocationAbbr
AK     3080
IA     3080
AR     3080
AZ     3080
CA     3080
CO     3080
CT     3080
DC     3080
DE     3080
FL     3080
GA     3080
MS     3080
HI     3080
MO     3080
IL     3080
IN     3080
KS     3080
KY     3080
LA     3080
MA     3080
MD     3080
ME     3080
MI     3080
ID     3080
MT     3080
NC     3080
WV     3080
AL     3080
ND     3080
NE     308

In [9]:
df.tail(4)

Unnamed: 0,RowId,YearStart,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Class,Topic,Question,Data_Value_Type,Data_Value_Unit,Data_Value,Data_Value_Alt,Data_Value_Footnote_Symbol,Data_Value_Footnote,Low_Confidence_Limit,High_Confidence_Limit,Break_Out_Category,Break_Out,ClassId,TopicId,QuestionId,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationId,Geolocation,Latitude,Longitude
160156,BRFSS~2020~58~BR012~RAC04~Age-Standardized,1970,USM,Median of all states,BRFSS,,,Healthy People 2030,,Risk Factors,Hypertension,Prevalence of hypertension medication use amon...,Age-Standardized,Percent (%),,-1.0,-,Data not available,,,Race,Hispanic,C2,T9,BR012,AgeStdz,BOC04,RAC04,58,,,
160157,BRFSS~2020~58~BR012~RAC04~Crude,1970,USM,Median of all states,BRFSS,,,Healthy People 2030,,Risk Factors,Hypertension,Prevalence of hypertension medication use amon...,Crude,Percent (%),,-1.0,-,Data not available,,,Race,Hispanic,C2,T9,BR012,Crude,BOC04,RAC04,58,,,
160158,BRFSS~2020~58~BR012~RAC07~Crude,1970,USM,Median of all states,BRFSS,,,Healthy People 2030,,Risk Factors,Hypertension,Prevalence of hypertension medication use amon...,Crude,Percent (%),,-1.0,-,Data not available,,,Race,Other,C2,T9,BR012,Crude,BOC04,RAC07,58,,,
160159,BRFSS~2020~58~BR012~RAC07~Age-Standardized,1970,USM,Median of all states,BRFSS,,,Healthy People 2030,,Risk Factors,Hypertension,Prevalence of hypertension medication use amon...,Age-Standardized,Percent (%),,-1.0,-,Data not available,,,Race,Other,C2,T9,BR012,AgeStdz,BOC04,RAC07,58,,,


In [12]:
topic_grp = df.groupby(by=['Topic'], observed =False)
topic_grp.apply(lambda x: x.head())

Unnamed: 0_level_0,Unnamed: 1_level_0,RowId,YearStart,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Class,Topic,Question,Data_Value_Type,Data_Value_Unit,Data_Value,Data_Value_Alt,Data_Value_Footnote_Symbol,Data_Value_Footnote,Low_Confidence_Limit,High_Confidence_Limit,Break_Out_Category,Break_Out,ClassId,TopicId,QuestionId,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationId,Geolocation,Latitude,Longitude
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
Acute Myocardial Infarction (Heart Attack),11440,BRFSS~2011~01~BR003~OVR01~Crude,1970,AL,Alabama,BRFSS,Million Hearts,,,,Cardiovascular Diseases,Acute Myocardial Infarction (Heart Attack),Prevalence of acute myocardial infarction (hea...,Crude,Percent (%),5.0,5.0,,,4.5,5.7,Overall,Overall,C1,T3,BR003,Crude,BOC01,OVR01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Acute Myocardial Infarction (Heart Attack),11441,BRFSS~2011~01~BR003~OVR01~Age-Standardized,1970,AL,Alabama,BRFSS,Million Hearts,,,,Cardiovascular Diseases,Acute Myocardial Infarction (Heart Attack),Prevalence of acute myocardial infarction (hea...,Age-Standardized,Percent (%),4.5,4.5,,,4.0,5.1,Overall,Overall,C1,T3,BR003,AgeStdz,BOC01,OVR01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Acute Myocardial Infarction (Heart Attack),11442,BRFSS~2011~01~BR003~GEN01~Age-Standardized,1970,AL,Alabama,BRFSS,Million Hearts,,,,Cardiovascular Diseases,Acute Myocardial Infarction (Heart Attack),Prevalence of acute myocardial infarction (hea...,Age-Standardized,Percent (%),6.4,6.4,,,5.4,7.5,Gender,Male,C1,T3,BR003,AgeStdz,BOC02,GEN01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Acute Myocardial Infarction (Heart Attack),11443,BRFSS~2011~01~BR003~GEN01~Crude,1970,AL,Alabama,BRFSS,Million Hearts,,,,Cardiovascular Diseases,Acute Myocardial Infarction (Heart Attack),Prevalence of acute myocardial infarction (hea...,Crude,Percent (%),6.7,6.7,,,5.7,7.9,Gender,Male,C1,T3,BR003,Crude,BOC02,GEN01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Acute Myocardial Infarction (Heart Attack),11444,BRFSS~2011~01~BR003~GEN02~Age-Standardized,1970,AL,Alabama,BRFSS,Million Hearts,,,,Cardiovascular Diseases,Acute Myocardial Infarction (Heart Attack),Prevalence of acute myocardial infarction (hea...,Age-Standardized,Percent (%),3.0,3.0,,,2.6,3.4,Gender,Female,C1,T3,BR003,AgeStdz,BOC02,GEN02,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Cholesterol Abnormalities,57200,BRFSS~2011~01~BR013~OVR01~Age-Standardized,1970,AL,Alabama,BRFSS,,,Healthy People 2030,,Risk Factors,Cholesterol Abnormalities,Prevalence of cholesterol screening in the pas...,Age-Standardized,Percent (%),95.8,95.8,,,94.9,96.5,Overall,Overall,C2,T10,BR013,AgeStdz,BOC01,OVR01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Cholesterol Abnormalities,57201,BRFSS~2011~01~BR013~OVR01~Crude,1970,AL,Alabama,BRFSS,,,Healthy People 2030,,Risk Factors,Cholesterol Abnormalities,Prevalence of cholesterol screening in the pas...,Crude,Percent (%),96.0,96.0,,,95.3,96.7,Overall,Overall,C2,T10,BR013,Crude,BOC01,OVR01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Cholesterol Abnormalities,57202,BRFSS~2011~01~BR013~GEN01~Crude,1970,AL,Alabama,BRFSS,,,Healthy People 2030,,Risk Factors,Cholesterol Abnormalities,Prevalence of cholesterol screening in the pas...,Crude,Percent (%),96.3,96.3,,,95.1,97.2,Gender,Male,C2,T10,BR013,Crude,BOC02,GEN01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Cholesterol Abnormalities,57203,BRFSS~2011~01~BR013~GEN01~Age-Standardized,1970,AL,Alabama,BRFSS,,,Healthy People 2030,,Risk Factors,Cholesterol Abnormalities,Prevalence of cholesterol screening in the pas...,Age-Standardized,Percent (%),96.2,96.2,,,94.9,97.2,Gender,Male,C2,T10,BR013,AgeStdz,BOC02,GEN01,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048
Cholesterol Abnormalities,57204,BRFSS~2011~01~BR013~GEN02~Age-Standardized,1970,AL,Alabama,BRFSS,,,Healthy People 2030,,Risk Factors,Cholesterol Abnormalities,Prevalence of cholesterol screening in the pas...,Age-Standardized,Percent (%),95.3,95.3,,,94.1,96.3,Gender,Female,C2,T10,BR013,AgeStdz,BOC02,GEN02,1,-86.63186076199969 32.84057112200048,-86.63186076199969,32.84057112200048


In [11]:
# ## Data Break Down

# The rows are unique and data is wrapped into the row id:
#   Row ID Format: DataSource-Year-LocationID-QuestionId-BreakOutId-Data_Value_Type
#                  BRFSS~2011~01~BR001~OVR01~Age-Standardized	


# YearStart 

# ### Location Info:
#     LocationAbbr                
#     LocationDesc  
#     LocationId                  
#     Geolocation                   
#     Latitude                      
#     Longitude                                   

# ### Data Values:   
#     Data_Value_Unit               
#     Data_Value                    
#     Data_Value_Alt                
#     Data_Value_Footnote_Symbol    
#     Data_Value_Footnote         
#     Low_Confidence_Limit          
#     High_Confidence_Limit     
#     DataSource -- constant for all rows --
#     Data_Value_Type
#         Age-Standardized     58240
#         Crude               101920
#         Name: count, dtype: int64

# ### Priority Area:
# PriorityArea1              
#     Million Hearts    22880
# PriorityArea3 
#     Healthy People 2030    80080                 
# PriorityArea2  - all NaN          
# PriorityArea4   - all NaN          

# ## Break Downs

# Class
#     Cardiovascular Diseases     57200
#     Risk Factors               102960                         
# Topic   
#     Coronary Heart Disease                        11440
#     Diabetes                                      11440
#     Major Cardiovascular Disease                  11440
#     Nutrition                                     11440
#     Obesity                                       11440
#     Physical Inactivity                           11440
#     Smoking                                       11440
#     Stroke                                        11440
#     Acute Myocardial Infarction (Heart Attack)    22880
#     Cholesterol Abnormalities                     22880
#     Hypertension                                  22880               

# Question
#     Prevalence of acute myocardial infarction (heart attack) among US adults (18+); BRFSS                    11440                 
#     Prevalence of cholesterol screening in the past 5 years among US adults (20+); BRFSS                     11440                 
#     Prevalence of consuming fruits and vegetables less than 5 times per day among US adults (18+); BRFSS     11440                 
#     Prevalence of coronary heart disease among US adults (18+); BRFSS                                        11440                 
#     Prevalence of current smoking among US adults (18+); BRFSS                                               11440                 
#     Prevalence of diabetes among US adults (18+); BRFSS                                                      11440                 
#     Prevalence of high total cholesterol among US adults (20+); BRFSS                                        11440                 
#     Prevalence of hypertension among US adults (18+); BRFSS                                                  11440                 
#     Prevalence of hypertension medication use among US adults (18+) with hypertension; BRFSS                 11440                 
#     Prevalence of major cardiovascular disease among US adults (18+); BRFSS                                  11440                 
#     Prevalence of obesity among US adults (20+); BRFSS                                                       11440                 
#     Prevalence of physical inactivity among US adults (18+); BRFSS                                           11440                 
#     Prevalence of post-hospitalization rehabilitation among heart attack patients, US adults (18+); BRFSS    11440                 
#     Prevalence of stroke among US adults (18+); BRFSS                                                        11440                 

# Break_Out_Category
#     Overall    14560
#     Gender     29120
#     Age        43680
#     Race       72800

# Break_Out
#     20-24                  1560
#     18-24                  5720
#     25-44                  7280
#     35+                    7280
#     45-64                  7280
#     65+                    7280
#     75+                    7280
#     Female                14560
#     Hispanic              14560
#     Male                  14560
#     Non-Hispanic Asian    14560
#     Non-Hispanic Black    14560
#     Non-Hispanic White    14560
#     Other                 14560
#     Overall               14560                


# ### ID Columns:
# ClassId                     
# TopicId                     
# QuestionId                  
# Data_Value_TypeID           
# BreakOutCategoryId          
# BreakOutId      
