# Analysis of Tobacco sales across States in the US
by Group 14 (Zilin Zhang, Jessica Chen, Daniel Jang, Isabel Adelhardt)

## hypothesis: Whether median income per state has an impact to sales
1. sales per year in each state from 2013 - 2019
2. median income in each state per year from 2013 - 2019
2.1 if each state has a +/- difference in median income between years does this relate to the difference in sales
3. regression plots for sales and median income in each state per year

In [3]:
import pandas as pd

## Step 1: sales in each state per year from 2013-2019

In [21]:

sales_DATA = pd.read_csv("data/U.S._Chronic_Disease_Indicators__Tobacco.csv", low_memory=False)

# sales for each state in years from 2013 to 2019
# filter question
states_sales = sales_DATA[sales_DATA["Question"] == "Sale of cigarette packs"]
# filter columns
states_sales = states_sales[['YearStart', 'LocationAbbr', 'Question', 'DataValueUnit', 'DataValue']]
states_sales = states_sales.groupby(['YearStart', 'LocationAbbr']).sum().reset_index()
# remove states GU, PR, US, VI since they are not actual states.
states_sales = states_sales.loc[states_sales['DataValue'] != 0.0]

states_sales

Unnamed: 0,YearStart,LocationAbbr,Question,DataValueUnit,DataValue
0,2013,AK,Sale of cigarette packs,pack sales per capita,39
1,2013,AL,Sale of cigarette packs,pack sales per capita,64.6
2,2013,AR,Sale of cigarette packs,pack sales per capita,57.5
3,2013,AZ,Sale of cigarette packs,pack sales per capita,24.4
4,2013,CA,Sale of cigarette packs,pack sales per capita,23.9
...,...,...,...,...,...
325,2019,VT,Sale of cigarette packs,pack sales per capita,31.5
326,2019,WA,Sale of cigarette packs,pack sales per capita,15
327,2019,WI,Sale of cigarette packs,pack sales per capita,35.4
328,2019,WV,Sale of cigarette packs,pack sales per capita,75


## Step 2: Median income in each state per year from 2013 - 2019

Reading in median income data which is from Source: U.S. Bureau of the Census, Current Population Survey, Annual Social and Economic Supplements. For information on confidentiality protection, sampling error, nonsampling error, and definitions, see <www2.census.gov/programs-surveys/cps/techdocs/cpsmar19.pdf>.



We are using 2017 Median income instead of 2017(40) because 2017(40) represents data from a new data processing system. The 2017(40) data should be used for analysis of median income after 2017, but our analysis relates mostly to before 2017.

We are using 2013(38) beause 2013(39) represents data from individuals who received new income questions. These new income questions were not used in the following years, so we are disregarding this data. 

In [22]:
median_DATA = pd.read_csv('data/median_income.csv')
cols_drop = median_DATA.columns[median_DATA.columns.str.contains("Standard")]
median_income = median_DATA.drop(columns=cols_drop).set_index("Location")

#drop 2017(40) and 2013(39)
median_income = median_income.drop(median_income.columns[[0, 1, 4, 9]], axis = 1)
median_income.head()

Unnamed: 0_level_0,2019 Median income,2018 Median income,2017 Median income,2016 Median income,2015 Median income,2014 Median income,2013 (38) Median income
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
United States,68703,63179,61372,59039,56516,53657,51939
Alabama,56200,49936,51113,47221,44509,42278,41381
Alaska,78394,68734,72231,75723,75112,67629,61137
Arizona,70674,62283,61125,57100,52248,49254,50602
Arkansas,54539,49781,48829,45907,42798,44922,39919
