# Health Burden

## Set up

In [46]:
# Read in data
import pandas
data = pandas.read_csv('./data/prepped/risk-data.csv')

## Data Structure
To get a basic sense of your dataset, check the following:

- How large is the dataset (rows, columns)?
- What are the variables present in the dataset?
- What is the data type of each variable?

In [47]:
# dataset size
shape = data.shape
print("rows=%d, cols=%d" % shape)
print()

# variables present in dataset
names = data.columns.values
print('Column names:')
print(names)
print()

# data types of variables
dtypes = data.dtypes
print('Data types:')
print(dtypes)

rows=1950, cols=12

Column names:
['country' 'country.code' 'super.region' 'region' 'sex' 'age' 'pop'
 'alcohol.use' 'drug.use' 'high.meat' 'low.exercise' 'smoking']

Data types:
country          object
country.code     object
super.region     object
region           object
sex              object
age              object
pop             float64
alcohol.use     float64
drug.use        float64
high.meat       float64
low.exercise    float64
smoking         float64
dtype: object



## Univariate Analysis
For each variable of interest, answer the following questions. As you do so, begin making a list of further questions you would like to investigate:

- What does the distribution of each (risk factor) variable look like?
- Is any variable ever missing (and if so, why)?
- What are the basic summary statistics (mean, median, standard deviation) each variable, and what is it's range (min/max)?
- What do you find surprising?

In [62]:
# import plotly to graph distributions
from plotly.offline import plot
import plotly.graph_objs as go

alc_data = [go.Histogram(
    x=data.get('alcohol.use'),
)]

drug_data = [go.Histogram(
    x=data.get('drug.use'),
)]

meat_data = [go.Histogram(
    x=data.get('high.meat'),
)]

exc_data = [go.Histogram(
    x=data.get('low.exercise'),
)]

# plot data
plot(alc_data)
plot(drug_data)
plot(meat_data)
plot(exc_data)

'file:///Users/amberkim/Google Drive (uw)/INFO 370/eda/health-burden/temp-plot.html'

In [70]:
# any missing variables
#data.isnull()
missing = data['high.meat'].isnull()
data[missing]

Unnamed: 0,country,country.code,super.region,region,sex,age,pop,alcohol.use,drug.use,high.meat,low.exercise,smoking
2,China,CHN,"Southeast Asia, East Asia, and Oceania",East Asia,male,Under 5,4.458679e+07,0.236908,0.120827,,,
3,China,CHN,"Southeast Asia, East Asia, and Oceania",East Asia,male,5-14 years,8.299538e+07,1.292633,0.038498,,,
6,China,CHN,"Southeast Asia, East Asia, and Oceania",East Asia,female,5-14 years,7.148107e+07,0.868670,0.039037,,,
8,China,CHN,"Southeast Asia, East Asia, and Oceania",East Asia,female,Under 5,3.854704e+07,0.183050,0.126190,,,
10,North Korea,PRK,"Southeast Asia, East Asia, and Oceania",East Asia,male,Under 5,8.955102e+05,0.186889,0.101431,,,
11,North Korea,PRK,"Southeast Asia, East Asia, and Oceania",East Asia,female,Under 5,8.553315e+05,0.195622,0.109240,,,
15,North Korea,PRK,"Southeast Asia, East Asia, and Oceania",East Asia,male,5-14 years,1.826308e+06,1.078984,0.029139,,,
18,North Korea,PRK,"Southeast Asia, East Asia, and Oceania",East Asia,female,5-14 years,1.751028e+06,0.709501,0.028816,,,
23,Taiwan,TWN,"Southeast Asia, East Asia, and Oceania",East Asia,male,5-14 years,1.278556e+06,0.605018,0.006473,,,
26,Taiwan,TWN,"Southeast Asia, East Asia, and Oceania",East Asia,female,5-14 years,1.167340e+06,0.424741,0.005962,,,


In [76]:
# summary statistics for each variable
print(data.describe())
print()
print('medians: ')
print(data.median())

                pop  alcohol.use     drug.use    high.meat  low.exercise  \
count  1.950000e+03  1950.000000  1950.000000  1170.000000   1170.000000   
mean   3.777708e+06    47.146545     7.128357     2.594225    119.552443   
std    1.913391e+07    86.009150    14.197640     4.043355    153.239610   
min    3.563755e+02  -106.232008     0.000038     0.000176      0.308731   
25%    1.000900e+05     0.273705     0.030068          NaN           NaN   
50%    5.054975e+05     6.784878     1.917572          NaN           NaN   
75%    2.071966e+06    54.987255     8.528416          NaN           NaN   
max    3.901690e+08   662.914151   314.625888    36.087746    844.249502   

           smoking  
count  1170.000000  
mean    299.520438  
std     437.167168  
min       0.532166  
25%            NaN  
50%            NaN  
75%            NaN  
max    2691.239677  

medians: 
pop             505497.500000
alcohol.use          6.784878
drug.use             1.917572
high.meat            0.81


Invalid value encountered in percentile



## Univariate analysis (by age)
In this section, you should investigate how each (risk-variable) varies by **age group**. More specifically, consider if the distribution of each variable of interest (smoking, alcohol use, etc.) is consistent across age-groups.

In [3]:
# Code goes here

## Univariate analysis (by sex)
In this section, you should investigate how each (risk-variable) varies by **sex group**. More specifically, consider if the distribution of each variable of interest (smoking, alcohol use, etc.) is consistent across sex-groups. Depending on your procedure, you may need to **reshape your data**.

In [4]:
# Code goes here

## Univariate analysis (by country)
In this section, you should investigate how each (risk-variable) varies by **country**. Given the number of countries present in the dataset, I suggest that you aggregate your data by region. In order to do this, you'll need to **convert death rates to deaths** using the `pop` column.

In [5]:
# Code goes here

## Bivariate analysis
In this section, you should compare risks-variables to one another to see how they co-vary. Use simple statistical tests (i.e., **correlation**) and visualization as you see fit. 

In [6]:
# Code goes here