<h1><center>INFO 370 Final Project Resource: Predicting Global Dietary Risks</center></h1>
<h2><center>Brian Luu, Sherry Gao, Youta Ishii, Zak Zheng</center></h2>

## Purpose

The purpose of this project is to gain understanding of how certain food diets in certain world regions influence the overall death rates in a region. We will be taking a look at data relating to dietary risks and their death rates and conduct analysis, as well as run regression modeling to predict the future prevalence of dietary risks and its effects on death rates in the respective region. 

## Dataset

Below is a general overview of the dataset we used for our analysis:
    
Context: Risk

Location: 8 regions
    (Global, East Asia & Pacific, Europe & Central Asia, 
     Latin America & Caribbean, Middel East & North Africa,
     North America, South Asia, Sub-Saharan Africa)

Risk: 18 in total

Alcohol Use
Iron Deficiency
Vitamin A Deficiency
Zinc Deficiency
Diet Low in Fruits
Diet Low in Vegetables 
Diet Low in Whole Grains
Diet Low in Nuts and Seeds
Diet Low in Milk
Diet High in Red Meat
Diet HIgh in Processed Meat
Diet High in Sugar-Sweetened Bev
Diet Low in Fiber
Diet Suboptimal in Calcium
Diet Low in Seafood Omega 3 Fatty Acids
Diet Low in Polyunsaturated Fatty Acids
Diet High in Trans Fatty Acids
Diet High in Sodium 

Age: All

Sex: Both Biological Sex (Female, Male)

Year: 1990, 1995, 2000, 2005, 2010, 2015

Measure: Death Rates

Metric Used: Percentage


We obtained our dataset from GHDx: 
http://ghdx.healthdata.org/gbd-results-tool/

The column glossary can be found on
http://www.healthdata.org/terms-defined

## Insights

After conducting our analysis, we believe our resource is able to compute the future prevalance of these diet-related risks with significant accuracy. We believe our analysis and predictions can be used by world health organizations, as well as government officials, to predict years ahead the overall dietary trends in their respective countries, and begin taking action to mediate or even eradicate the problem as a whole. 

Users of this resource are able to interact and pull up analytical information and modeling predictions regarding their respective regions. Then, with the new prediction in hand, they are able to further conduct their country, or region-specific research to understand the core issue of the problem. We hope this acts as a catalyst for the ultimate solution to dietary risks and their causes all over the world.  

## Set up

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
import seaborn as sns
import statsmodels.formula.api as smf # linear modeling
import statsmodels.api as sm

# Read in data
import pandas as pd
df = pd.read_csv('./final.csv')
#import the module so that we can tables when printing dataframes
from IPython.display import display, HTML

**We will be taking in data from the following years to train our model**

In [4]:
df_years = df.year.unique()
print(df_years)

[1990 1995 2000 2010 2005 2015]


**We will be taking data that are collected from the following 8 regions**

In [5]:
df_locations = df.location_name.unique()
print(df_locations)

['Global' 'Latin America & Caribbean - WB' 'Europe & Central Asia - WB'
 'North America' 'South Asia - WB' 'East Asia & Pacific - WB'
 'Sub-Saharan Africa - WB' 'Middle East & North Africa - WB']


In [6]:
# Create dataframes for each risk
df = df[df.rei_name != 'Low physical activity']
df = df[df.rei_name != 'Dietary risks']


alcohol_use = df.loc[df['rei_name'] == 'Alcohol use']
iron_deficiency = df.loc[df['rei_name'] == 'Iron deficiency']
vitamin_a_def = df.loc[df['rei_name'] == 'Vitamin A deficiency']
zinc_def = df.loc[df['rei_name'] == 'Zinc deficiency']
low_fruits = df.loc[df['rei_name'] == 'Diet low in fruits']
low_vegetables = df.loc[df['rei_name'] == 'Diet low in vegetables']
low_wholegrains = df.loc[df['rei_name'] == 'Diet low in whole grains']
low_nutsseeds = df.loc[df['rei_name'] == 'Diet low in nuts and seeds']
low_milk = df.loc[df['rei_name'] == 'Diet low in milk']
high_redmeat = df.loc[df['rei_name'] == 'Diet high in red meat']
high_processedmeat = df.loc[df['rei_name'] == 'Diet high in processed meat']
high_sugarbev = df.loc[df['rei_name'] == 'Diet high in sugar-sweetened beverage']
low_fiber = df.loc[df['rei_name'] == 'Diet low in fiber']
suboptimal_calcium = df.loc[df['rei_name'] == 'Diet suboptimal in calcium']
low_omega3 = df.loc[df['rei_name'] == 'Diet low in seafood omega-3 fatty acids']
low_polyunsaturated = df.loc[df['rei_name'] == 'Diet low in polyunsaturated fatty acids']
high_transfattyacid = df.loc[df['rei_name'] == 'Diet high in trans fatty acids']
high_sodium = df.loc[df['rei_name'] == 'Diet high in sodium']

## Method

For each of our analysis, we will be using a linear regression model to produce our statistical model and predictions.

<h3>Create a chart for a risk and see its prominence in each of the world's regions and prediction for 2020:</h3>

In [7]:
from ipywidgets import widgets
from IPython.display import display
from IPython.display import clear_output

columns = df.rei_name.unique().tolist()
selection = widgets.Dropdown(description = 'Select a risk')
selection.options = columns
display(selection)

def on_button_clicked(b):
    clear_output()
    p = sns.lmplot(x="year", y="val", hue="location_name", data=df[df['rei_name']==selection.value]);
    plt.xlabel('Year')
    plt.ylabel('Death rate (%)')
    plt.title(selection.value + ' and Death Rate: All Regions')
    
button = widgets.Button(description='Create graph')
display(button)

button.on_click(on_button_clicked)

<h3>Create a chart for a region and see each of the risk and its prominence in the region and prediction fo 2020:</h3>

In [8]:
columns2 = df.location_name.unique().tolist()
selection2 = widgets.Dropdown(description = 'Select a location')
selection2.options = columns2
display(selection2)

def on_button2_clicked(b):
    clear_output()
    p2 = sns.lmplot(x="year", y="val", hue="rei_name", data=df[df['location_name']==selection2.value]);
    plt.xlabel('Year')
    plt.ylabel('Death rate (%)')
    plt.title(selection2.value + ' and Death Rate: All Risks')
    
button2 = widgets.Button(description='Create graph')
display(button2)

button2.on_click(on_button2_clicked)

<h3>Our analysis will produce the risk's influence on deaths in the year 2020 as well produce a summary table of its statistics:</h3>

In [9]:
selection4 = widgets.Dropdown(description = 'Select a risk')
selection4.options = columns
display(selection4)

selection5 = widgets.Dropdown(description = 'Select a location')
selection5.options = columns2
display(selection5)

def on_button4_clicked(b):
    clear_output()
    print(selection4.value + " in " + selection5.value +  ": Summary")
    data1 = df[df['rei_name']==selection4.value]
    data2 = data1[data1['location_name']==selection5.value]
    lm = smf.ols(formula='val ~ year', data=data2).fit()
    print(lm.summary())
    print()
    
    print('Predictions for ' + selection4.value + ' in ' + selection5.value + 'for year 2020')
    for locations in df['location_name'].unique():
        tempdf = df[df.location_name == locations]
        tempdf = tempdf[tempdf.rei_name == selection4.value]
        X = tempdf['year']
        y = tempdf['val']
        model = sm.OLS(y, X)
        results = model.fit()
        d = []
        d.append([locations, results.predict(2020)])
        for item in d:
            print(item[0], ', '.join(map(str, item[1:])))
    
button4 = widgets.Button(description='Predict')
display(button4)

button4.on_click(on_button4_clicked)