<center><h2> Behavioral Risk Factor Data: Tobacco Use (2011 to present)</h2></center>  
<center><h3> Subtitle  </h3></center><br></br>
<center><h4>Authors:</h4></center>

<center><h4>Yashar Mansouri & Nasrin Khansari</h4></center>

In [24]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# !pip install sodapy # run this line if sodapy is not installed

from sodapy import Socrata


## Data Description & Resources

**Data Last Updated**: December 11, 2018

**Metadata Last Updated**: February 14, 2019

**Date Created**: June 3, 2014

**Data Provided by**: Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health

**Publisher**: Centers for Disease Control and Prevention

**Contact Name**: OSHData Support

**Contact Email**: OSHData@cdc.gov

**Bureau Code**: 009:20

**Program Code**: 009:020

**Public Access Level**: Public Domain

**Data Dictionary**: https://chronicdata.cdc.gov/Survey-Data/Behavioral-Risk-Factor-Data-Tobacco-Use-2011-to-pr/wsas-xwh5

**References**: https://chronicdata.cdc.gov/d/5amh-5sx3

**Glossary/Methodology**: https://chronicdata.cdc.gov/d/5amh-5sx3

**Category**: Survey Data

**Tags**: osh, office on smoking and health, state system, tobacco, survey, behavioral, risk, surveillance, tobacco use, cigarette, cigarette use, adult, smoking, smoking status, smoker, current, former, never, ever, frequency, every day, some days, demographics, age, gender, race, ethnicity, education, cessation, quit, prevalence, brfss

**License**: Public Domain

**Source Link**: http://www.cdc.gov/brfss/



## A quick brief on the data:

### Source
Behavioral Risk Factor Surveillance System Survey Data

### Methods
The BRFSS is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death. The data for the STATE System were extracted from the annual BRFSS surveys from participating states. For estimates among racial and ethnic subgroups, two-year combined data are available. Sample sizes of less than 50 were considered to be inadequate for data analysis. The STATE System does not display percentages for these sample sizes; instead, "NA" will appear in the percentage box for that demographic group.

### Sampling

For 2011 data and forward, a random-digit dialing system was used to select samples of adults in households with landline or cellular telephones. The sample represented adults from each state who were civilian, aged 18 years or older and not institutionalized. Most states now use a computer-assisted telephone interviewing software program (CATI). This allows the interviewer to enter the data directly into a computer, thus reducing errors and eliminating unacceptable responses. More detailed information on the sampling methodology is located on the BRFSS website (http://www.cdc.gov/brfss/)."

### Questionnaire

The questionnaire was composed of three sets of questions: 

1. A core set of questions asked in all participating states. 
2. A standard module containing questions on selected topics developed by CDC and asked at the discretion of each state. 
3. Questions developed for a particular state to meet a particular need.

The core questions allow data to be compared between states. Because many of the same questions are asked each year, emerging health trends can be identified and monitored.

**Notes**: "NA" indicates that survey data are not available. 

**Citation**: Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2019. 


In [25]:
#importing data

client = Socrata("chronicdata.cdc.gov", None)

#total rows of data is 33451, setting limit to 35000 to capture all data

results = client.get("wsas-xwh5", limit=35000)

# Convert to pandas DataFrame

df = pd.DataFrame.from_records(results)



Notes about the cols_to_drop:

1. **'topictype'** is Tobacco Use - Survey Data on all entries
2. **'topictypeid** is BEH on all entries
2. **'topicid', 'measureid', 'stratificationid1', 'stratificationid2', 'stratificationid3','stratificationid4', 'submeasureid', 'displayorder'** are not needed since theyre used on CDC's application.
3. **':@computed_region_bxsw_vy29', ':@computed_region_he4y_prf8'** created while importing via API and unnecessary
4. **'data_value_footnote', 'data_value_footnote_symbol'** unnecessary since null data_values are dropped based on these twp column
5. **'datasource'** is BRFSS on all entries
6. **'data_value_type'** is Percentage on all entries
7. **'data_value_unit'** is % on all entries

In [26]:
#dropping the unnecessary columns by reordering the required ones for better analysis purposes:
df = df[['year', 'locationabbr', 'locationdesc', 'topicdesc', 'measuredesc','age', 'race', 'education', 'response', 'sample_size', 'data_value', 'data_value_std_err', 'low_confidence_limit', 'high_confidence_limit']]

In [31]:
#dropping null values based on the data_value
df.dropna(subset = ['data_value', 'sample_size'], inplace = True)

In [32]:
#checking for duplicates
df.duplicated().sum() > 0

False

In [33]:
df.isna().sum()
#we'll keep the null response values

year                         0
locationabbr                 0
locationdesc                 0
topicdesc                    0
measuredesc                  0
age                          0
race                         0
education                    0
response                 20852
sample_size                  0
data_value                   0
data_value_std_err           0
low_confidence_limit         0
high_confidence_limit        0
dtype: int64

In [34]:
df.education.unique()

array(['All Grades', '< 12th Grade', '> 12th Grade', '12th Grade'],
      dtype=object)

In [35]:
df.year.unique()

array(['2017', '2016-2017', '2016', '2015-2016', '2015', '2014-2015',
       '2014', '2013-2014', '2013', '2012-2013', '2012', '2011-2012',
       '2011'], dtype=object)

In [None]:
df.query(' year == "2011-2012"')

In [50]:
#dropping 2 consecutive years/ cumulative data
idx_drop = df.query('(year == "2011-2012") | (year == "2013-2014") | (year == "2014-2015") | (year == "2015-2016") | (year == "2016-2017")').index

df.drop(idx_drop, inplace = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29098 entries, 0 to 33450
Data columns (total 14 columns):
year                     29098 non-null object
locationabbr             29098 non-null object
locationdesc             29098 non-null object
topicdesc                29098 non-null object
measuredesc              29098 non-null object
age                      29098 non-null object
race                     29098 non-null object
education                29098 non-null object
response                 10995 non-null object
sample_size              29098 non-null object
data_value               29098 non-null object
data_value_std_err       29098 non-null object
low_confidence_limit     29098 non-null object
high_confidence_limit    29098 non-null object
dtypes: object(14)
memory usage: 3.3+ MB


In [54]:
df

Unnamed: 0,year,locationabbr,locationdesc,topicdesc,measuredesc,age,race,education,response,sample_size,data_value,data_value_std_err,low_confidence_limit,high_confidence_limit,count
0,2017,AL,Alabama,Smokeless Tobacco Use (Adults),Current Use,45 to 64 Years,All Races,All Grades,,2413,6.5,0.6,5.2,7.8,24136.5
1,2017,AL,Alabama,Cessation (Adults),Percent of Former Smokers Among Ever Smokers,All Ages,All Races,All Grades,,2834,52.8,1.3,50.3,55.3,283452.8
2,2017,AL,Alabama,Smokeless Tobacco Use (Adults),Frequency of Use,All Ages,All Races,All Grades,Some Days,272,37.7,3.8,30.3,45.1,27237.7
3,2017,AL,Alabama,Cigarette Use (Adults),Current Smoking,All Ages,American Indian/Alaska Native,All Grades,,78,40.2,7.8,25,55.4,7840.2
4,2017,AL,Alabama,Cigarette Use (Adults),Current Smoking,18 to 44 Years,All Races,All Grades,,962,21.9,1.7,18.5,25.3,96221.9
6,2017,AL,Alabama,Cigarette Use (Adults),Smoking Frequency,All Ages,All Races,All Grades,Some Days,532,32.4,2.7,27.1,37.7,53232.4
7,2017,AL,Alabama,Smokeless Tobacco Use (Adults),User Status,All Ages,All Races,All Grades,Not Current,6491,93.7,0.4,92.8,94.6,649193.7
8,2017,AL,Alabama,Smokeless Tobacco Use (Adults),Current Use,All Ages,All Races,All Grades,,2600,11.5,0.8,9.8,13.2,260011.5
9,2017,AL,Alabama,Smokeless Tobacco Use (Adults),Frequency of Use,All Ages,All Races,All Grades,Some Days,59,69,7.6,54.1,83.9,5969
10,2017,AL,Alabama,Cigarette Use (Adults),Smoking Status,All Ages,All Races,All Grades,Current,2598,23.4,1.2,21.1,25.7,259823.4
