# Health Burden

## Set up

In [106]:
# Read in data
import matplotlib
matplotlib.style.use('ggplot')
import pandas
data = pandas.read_csv('./data/prepped/risk-data.csv')


## Data Structure
To get a basic sense of your dataset, check the following:

- How large is the dataset (rows, columns)?
- What are the variables present in the dataset?
- What is the data type of each variable?

In [101]:
num_columns = len(data.columns)
num_rows = len(data)

print ("\nThese are the columns and their data types: \n" + str(data.dtypes))
    
print("\n Number of Rows: " + str(num_rows) + "\n Number of Columns: " + str(num_columns) + "\n")

print(data.describe())




These are the columns and their data types: 
country          object
country.code     object
super.region     object
region           object
sex              object
age              object
pop             float64
alcohol.use     float64
drug.use        float64
high.meat       float64
low.exercise    float64
smoking         float64
dtype: object

 Number of Rows: 1950
 Number of Columns: 12

                pop  alcohol.use     drug.use    high.meat  low.exercise  \
count  1.950000e+03  1950.000000  1950.000000  1170.000000   1170.000000   
mean   3.777708e+06    47.146545     7.128357     2.594225    119.552443   
std    1.913391e+07    86.009150    14.197640     4.043355    153.239610   
min    3.563755e+02  -106.232008     0.000038     0.000176      0.308731   
25%    1.000900e+05     0.273705     0.030068     0.154818      4.033243   
50%    5.054975e+05     6.784878     1.917572     0.817486     40.743931   
75%    2.071966e+06    54.987255     8.528416     3.286753    209.803827 

In [58]:
data = data.sort_values("age")
data_no_na = data.dropna()
print(data_no_na.tail(5))

       country country.code                  super.region  \
1800      Chad          TCD            Sub-Saharan Africa   
1902   Nigeria          NGA            Sub-Saharan Africa   
1675  Tanzania          TZA            Sub-Saharan Africa   
1415    Turkey          TUR  North Africa and Middle East   
1676  Tanzania          TZA            Sub-Saharan Africa   

                            region     sex        age        pop  alcohol.use  \
1800    Western Sub-Saharan Africa    male  70+ years    91535.0   282.953866   
1902    Western Sub-Saharan Africa  female  70+ years  1386263.0   203.948491   
1675    Eastern Sub-Saharan Africa    male  70+ years   477080.0   399.503112   
1415  North Africa and Middle East  female  70+ years  2370019.0     8.360725   
1676    Eastern Sub-Saharan Africa  female  70+ years   557176.0   250.831526   

       drug.use  high.meat  low.exercise     smoking  
1800  27.271814   3.529395    226.732506  656.300923  
1902   5.562312   0.259720    155.36


## Univariate Analysis
For each variable of interest, answer the following questions. As you do so, begin making a list of further questions you would like to investigate:

- What does the distribution of each (risk factor) variable look like?
- Is any variable ever missing (and if so, why)?
- What are the basic summary statistics (mean, median, standard deviation) each variable, and what is it's range (min/max)?
- What do you find surprising?

In [77]:

print(data_no_na.median())
print()
print(data_no_na.std())
print()
print(data_no_na.min())
print()
print(data_no_na.max())
print()

print(data_no_na[data_no_na.smoking == data_no_na.smoking.max()])
print()
print()

data_good = data_no_na.rename(columns = {"alcohol.use" : "alcohol" })
print(data_no_na[data_good.alcohol == data_good.alcohol.min()])

pop             528688.000000
alcohol.use         41.167749
drug.use             6.578756
high.meat            0.817486
low.exercise        40.743931
smoking            105.816725
dtype: float64

pop             2.358007e+07
alcohol.use     9.949582e+01
drug.use        1.677325e+01
high.meat       4.043355e+00
low.exercise    1.532396e+02
smoking         4.371672e+02
dtype: float64

country                                              Afghanistan
country.code                                                 AFG
super.region    Central Europe, Eastern Europe, and Central Asia
region                                      Andean Latin America
sex                                                       female
age                                                  15-49 years
pop                                                      356.376
alcohol.use                                             -106.232
drug.use                                                0.244291
high.meat                    

The thing that was most interesting in looking at the information in this data set is the fact that there are countries that actually have negative numbers when looking at the number of deaths caused by alcohol 

## Univariate analysis (by age)
In this section, you should investigate how each (risk-variable) varies by **age group**. More specifically, consider if the distribution of each variable of interest (smoking, alcohol use, etc.) is consistent across age-groups.

In [107]:
group_a = data_good.groupby("age")
print(group_a.sum())
group_a.plot.hist()


                      pop       alcohol     drug.use    high.meat  \
age                                                                 
15-49 years  3.805412e+09   7996.099455  2122.933940    64.450803   
50-69 years  1.252127e+09  31233.367952  4808.302068   828.110557   
70+ years    3.977926e+08  52423.149897  6877.779155  2142.682087   

              low.exercise        smoking  
age                                        
15-49 years    1313.470128    6578.443305  
50-69 years   20292.251868   82617.865699  
70+ years    118270.636600  261242.603097  


age
15-49 years    Axes(0.125,0.125;0.775x0.755)
50-69 years    Axes(0.125,0.125;0.775x0.755)
70+ years      Axes(0.125,0.125;0.775x0.755)
dtype: object

## Univariate analysis (by sex)
In this section, you should investigate how each (risk-variable) varies by **sex group**. More specifically, consider if the distribution of each variable of interest (smoking, alcohol use, etc.) is consistent across sex-groups. Depending on your procedure, you may need to **reshape your data**.

In [109]:
group_s = data_good.groupby("sex")
print(group_s.sum())

group_s.corr(method = "person")

                 pop       alcohol      drug.use    high.meat  low.exercise  \
sex                                                                           
female  2.730318e+09  25195.357637   2630.228479  1272.283027  64674.985348   
male    2.725013e+09  66457.259666  11178.786684  1762.960421  75201.373247   

              smoking  
sex                    
female   81514.064525  
male    268924.847577  


## Univariate analysis (by country)
In this section, you should investigate how each (risk-variable) varies by **country**. Given the number of countries present in the dataset, I suggest that you aggregate your data by region. In order to do this, you'll need to **convert death rates to deaths** using the `pop` column.

## Bivariate analysis
In this section, you should compare risks-variables to one another to see how they co-vary. Use simple statistical tests (i.e., **correlation**) and visualization as you see fit. 

In [6]:
# Code goes here