# Voedselprijzen in de derde wereld

Robert-Jan Korteschiel (10399143)  
Robert Houten  
Sander Kohnstamm (10715363)  
Joost de Wildt (12173002)  


## Vooronderzoek

<br>
<div style="display: flex; width: 100%; justify-content: space-around;">
    <div style="width: 45%">
        <img src="food_viz.png" style="display: block; width: 100%; height: auto;" alt="GIS visualisation">
    </div>
    <div style="width: 45%">
        <img src="patterns.jpg" style="display: block; width: 100%; height: auto;" alt="Line visualisation">
    </div>
</div>


[bron] https://data.humdata.org/dataset/wfp-food-prices  
[algemene omschriving] https://docs.wfp.org/api/documents/WFP-0000040024/download/   
[dierpere omschrijving] http://mvam.org/2018/11/20/getting-up-to-speed-wfp-food-data-on-hdx/  
[snelle verkenning] https://dataviz.vam.wfp.org   
[UN exchange rates] https://treasury.un.org/operationalrates/OperationalRates.php  


## Onderzoeksvragen

1. Welke van de volgende gebeurtenissen heeft de grootste invloed op globale voedselprijzen?
    - Temperatuur
    - Brandstofprijs
    - Dagloon  


2. Vallen gebeurtenissen te herleiden uit voedselprijzen?



## Bootstrap

In [105]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from plotly import tools
from sklearn import preprocessing
from operator import itemgetter, attrgetter
from functools import partial
import datetime
import math
import seaborn as sns

from ipywidgets import Layout, Button, Box, HBox, VBox, widgets, interact, interact_manual, interactive, interactive_output


init_notebook_mode(connected=True)

pd.options.display.max_rows = 100
pd.options.display.max_seq_items = 100

# Food

## Inladen


In [106]:
food_df = pd.read_csv("./food_data/food.csv", low_memory=False)
display(food_df.head())

Unnamed: 0,adm0_id,adm0_name,adm1_id,adm1_name,mkt_id,mkt_name,cm_id,cm_name,cur_id,cur_name,pt_id,pt_name,um_id,um_name,mp_month,mp_year,mp_price,mp_commoditysource
0,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,15,Retail,5,KG,1,2014,50.0,
1,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,15,Retail,5,KG,2,2014,50.0,
2,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,15,Retail,5,KG,3,2014,50.0,
3,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,15,Retail,5,KG,4,2014,50.0,
4,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,15,Retail,5,KG,5,2014,50.0,


## Filter

[TODO] Filter waardes in de food_data die verstorend werken

# Transformeren

Ik heb ditmaal geprobeerd de global namespace een beetje schoner te houden, de grafieken doen nu elk hun eigen transformaties. Globale transformaties worden hier als kolom aan de food_df toegevoegd. Ik twijfel nog over of de
food_diff_df hier eigenlijk wel moet staan.

[TODO] Normalisatie met eikpunt

In [201]:
# voeg een kolom toe met een goed geformateerde datum
food_df["mp_date"] = pd.to_datetime(food_df[["mp_year", "mp_month"]].assign(day=1).rename(columns={"mp_year": "year", "mp_month": "month"}))

# voeg een kolom met een groepering van commodities toe (geselecteerd op het eerste woord)
food_df["cm_name_grouped"] = food_df["cm_name"].str.extract('^([\w\-]+)', expand=False)

# minmax normalisatie
def minmax_normalize_group(key, group_df):
    """minmax normalisatie van een kolom in een dataframe"""
    # bootstrap a new normalizer   
    min_max_scaler = preprocessing.MinMaxScaler()
    
    # reshape the series to array so that the normalizer accepts it
    array_to_normalize = group_df[key].values.reshape(-1, 1)
    
    # do the actual normalisation     
    x_scaled = min_max_scaler.fit_transform(array_to_normalize)
    
    # undo some weird numpy nesting of arrays    
    x_scaled = np.concatenate(x_scaled).ravel()
    
    # concatenate it to the group_df    
    group_df[f"{key}_norm"] = x_scaled
    
    # trow the group out     
    return group_df

# run the normalization per country on individual goods
food_df = food_df.groupby(by=["adm0_name", "cm_name"]).apply(partial(minmax_normalize_group, "mp_price"))
display(food_df.head())

# bereken het gemiddelde per jaar en bereken het verschil per commodity per jaar
food_diff_df = food_df.groupby(by=["adm0_name", "cm_name_grouped", "mp_year"])["mp_price_norm"].mean().unstack(level=2).diff(axis=1)
display(food_diff_df.head())

# quick lists
best_commodities = food_df.groupby(by=["adm0_name", "cm_name_grouped", "mp_year"]).any().groupby(level=["cm_name_grouped"]).size().sort_values(ascending=False)[:10].index
best_countries = food_df.groupby(by=["adm0_name"]).size().sort_values(ascending=False)[:10].index
display(best_commodities, best_countries)

# utility om selecteren met een crosssectie makkelijker te maken
def select_commodity(commodity, year):
    return food_diff_df.xs(commodity, level="cm_name_grouped", drop_level=True)[year]


Unnamed: 0,adm0_id,adm0_name,adm1_id,adm1_name,mkt_id,mkt_name,cm_id,cm_name,cur_id,cur_name,...,pt_name,um_id,um_name,mp_month,mp_year,mp_price,mp_commoditysource,mp_date,cm_name_grouped,mp_price_norm
0,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,...,Retail,5,KG,1,2014,50.0,,2014-01-01,Bread,0.608974
1,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,...,Retail,5,KG,2,2014,50.0,,2014-02-01,Bread,0.608974
2,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,...,Retail,5,KG,3,2014,50.0,,2014-03-01,Bread,0.608974
3,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,...,Retail,5,KG,4,2014,50.0,,2014-04-01,Bread,0.608974
4,1.0,Afghanistan,272,Badakhshan,266,Fayzabad,55,Bread - Retail,0.0,AFN,...,Retail,5,KG,5,2014,50.0,,2014-05-01,Bread,0.608974


Unnamed: 0_level_0,mp_year,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
adm0_name,cm_name_grouped,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Afghanistan,Bread,,,,,,,,,,,...,,,,,,0.055021,-0.007271,-0.006384,-0.004228,0.0009
Afghanistan,Exchange,,,,,,,,,,,...,,,,,,0.178623,0.301038,0.014753,0.200153,0.154888
Afghanistan,Fuel,,,,,,,,,,-0.007045,...,,,,,,-9.3e-05,-4.6e-05,4.4e-05,3.5e-05,1.4e-05
Afghanistan,Livestock,,,,,,,,,,0.135611,...,,,,,,,,,,
Afghanistan,Rice,,,,,,,,,,,...,-0.041588,0.015685,0.072097,0.123801,-0.054074,-0.047844,0.025185,0.038895,0.025792,-0.00298


Index(['Rice', 'Maize', 'Oil', 'Wheat', 'Beans', 'Sugar', 'Sorghum', 'Meat',
       'Millet', 'Fuel'],
      dtype='object', name='cm_name_grouped')

Index(['Rwanda', 'Bassas da India', 'Mali', 'Niger', 'Syrian Arab Republic',
       'Democratic Republic of the Congo', 'Kyrgyzstan', 'Burundi', 'Zambia',
       'Lebanon'],
      dtype='object', name='adm0_name')

## Verkenning

### [Barchart] Hoeveelheid data

In [108]:
def barchart_render(data, title):
    """render a horizontal barchart with some configuration"""
    barchart_trace = go.Bar(y=data.index, x=data.values, orientation='h')
    layout = go.Layout(yaxis=dict(automargin=True),
                       height=200 + 20 * len(data.index),
                       title=title)

    iplot(go.Figure(data=[barchart_trace], layout=layout))


def barchart_data(value="total_records", grouping=["adm0_name"], length=25):
    """switch between different groupings and values"""
    # group by the user selection
    groups = food_df.groupby(by=[*grouping, "mp_year"])

    # check the value to determine the metric
    if value == "total_records":
        data = groups.size()

    if value == "filled_years":
        data = groups["mp_price"].any()

    # regroup by the last level, sum and sort
    data = data.groupby(level=grouping[-1:]).sum().sort_values()
    title = f"Total {grouping} by {value}: {data.sum()} "
    barchart_render(data[-length:], title)


def barchart_ui():
    "seperates the ui from the data and rendering"
    value_toggle = widgets.ToggleButtons(
        options=[("Total Records",
                  "total_records"), ("Filled years", "filled_years")])

    grouping_toggle = widgets.ToggleButtons(
        options=[("Original Commodities", ["adm0_name", "cm_name"]),
                 ("Grouped Commodities", ["adm0_name", "cm_name_grouped"]), 
                 ("Country", ["adm0_name"])])

    length_slider = widgets.IntSlider(min=1,
                                      max=51,
                                      step=1,
                                      continuous_update=False,
                                      value=25)

    interact(barchart_data,
             value=value_toggle,
             grouping=grouping_toggle,
             length=length_slider)


barchart_ui()

interactive(children=(ToggleButtons(description='value', options=(('Total Records', 'total_records'), ('Filled…

### [Heatmap] Locatie van de data

[TODO] Has a weird bug on first render which doesn't show chart

In [312]:
def heatmap_render(data, trace_layout, title):
    """render the output"""
    heatmap_trace = go.Heatmap(z=data.values,
                               x=data.columns,
                               y=data.index,
                               **trace_layout)
    
    axis = dict(tickfont=dict(size=10), showticklabels=True, automargin=True)
    layout = go.Layout(autosize=False,
                       width=800,
                       height=200 + 12 * len(data.index),
                       xaxis={"range": [1990, 2019]},
                       yaxis=axis,
                       title=title)

    fig = go.Figure(data=[heatmap_trace], layout=layout)
    iplot(fig)


def heatmap_data(subject, value, commodity, country):
    """transform the food_df to something usable"""
    groups = food_df.groupby(by=["adm0_name", "cm_name_grouped", "mp_year"])

    # determine the metric
    if value == "count":
        metric = groups.size()
        trace_layout = {"zmin": 0, "zmax": 200, "colorscale": "Viridis"}

    elif value == "mp_price_norm":
        metric = groups["mp_price_norm"].mean()
        trace_layout = {"zmin": 0, "zmax": 1, "colorscale": "Viridis"}

    elif value == "mp_price_norm_diff":
        metric = groups["mp_price_norm"].mean().groupby(level=["adm0_name", "cm_name_grouped"]).diff()
        trace_layout = {
            "zmin": -0.4,
            "zmax": 0.4,
            "colorscale": [[1, 'rgb(239,138,98)'], [0.5, 'rgb(255,255,255)'],
                           [0, 'rgb(103,169,207)']]
        }

    # determine the level on which the cross-section needs to be taken
    def take_xs(subject, metric):
        if subject == "commodity":
            return metric.xs(commodity, level="cm_name_grouped")

        elif subject == "country":
            return metric.xs(country, level="adm0_name")

    # take the cross-section, unstack, sort and add missing years
    data = take_xs(subject, metric).unstack(level="mp_year")
    data = data.reindex(sorted(data.columns), axis=1)

    # construct a count, title and render
    record_count = take_xs(subject, groups.size()).sum()
    title = f"{commodity if subject == 'commodity' else country} ({record_count} records)"
    heatmap_render(data, trace_layout, title)


def heatmap_ui():
    """seperates the quite complex heatmap ui from the transformations and rendering"""
    # construct a dict of togglable widgets so hiding becomes easier
    togglable_widgets = {
        "country": widgets.Dropdown(options=[*food_df["adm0_name"].unique()]),
        "commodity": widgets.Dropdown(options=[*best_commodities])
    }

    # toggle buttons
    subject_widget = widgets.ToggleButtons(
        options=["commodity", "country"],
        description='Subject',
        value="commodity",
    )

    value_widget = widgets.ToggleButtons(
        options=["count", "mp_price_norm", "mp_price_norm_diff"],
        description='Value',
        value="count",
    )

    # setup ui layout
    ui = VBox([
        HBox([
            subject_widget, togglable_widgets["country"],
            togglable_widgets["commodity"]
        ]), value_widget
    ])

    def widget_toggle_func(**kwargs):
        """helper to toggle visiblity of widgets, uses a closure to get access to the toggleable widgets"""
        for widget in togglable_widgets.values():
            widget.layout.display = "none"
        togglable_widgets[kwargs["subject"]].layout.display = "flex"
        # render the actual barchart
        heatmap_data(**kwargs)

    # interactive argument widget bindings
    out = interactive_output(
        widget_toggle_func, {
            "subject": subject_widget,
            "value": value_widget,
            "commodity": togglable_widgets["commodity"],
            "country": togglable_widgets["country"]
        })

    # display the ui and bind the output
    display(ui, out)


heatmap_ui()

VBox(children=(HBox(children=(ToggleButtons(description='Subject', options=('commodity', 'country'), value='co…

Output()

### Line / Histo / Descriptives UI

In [110]:
def linehisto_ui(func):
    # toggle buttons
    value_widget = widgets.ToggleButtons(
        options=["mp_price", "mp_price_norm"],
        description='Value',
        value="mp_price_norm",
    )

    interact(func,
             value=value_widget,
             country=[*food_df["adm0_name"].unique(), "World"],
             commodity=[*best_commodities, "All"])

### [Line] Means, medians and spread

In [305]:
def make_line_trace(data, name):
    return go.Scatter(x=data.index, y=data.values, name=name)


def make_scatter_trace(data):
    return go.Scattergl(x=data.index,
                        y=data,
                        mode='markers',
                        name="Meetpunten",
                        marker=dict(color="#2077b4",
                                    size=2,
                                    line=dict(width=0)))


def line_render(data_scatter, data_mean, data_median, data_world_mean, title):
    """plot a line"""
    trace_scatter = make_scatter_trace(data_scatter)
    trace_mean = make_line_trace(data_mean, "Mean")
    trace_median = make_line_trace(data_median, "Median")
    trace_world_mean = make_line_trace(data_world_mean, "World Mean")

    layout = go.Layout(title=title)
    fig_scatter = go.Figure(
        data=[trace_scatter, trace_mean, trace_median, trace_world_mean],
        layout=layout)

    iplot(fig_scatter)


def line_data(value, country, commodity):
    """seperate the transformations from ui and rendering"""
    # scatter
    data = food_df if country == "World" else food_df[food_df["adm0_name"] == country]
    data = data if commodity == "All" else data[data["cm_name_grouped"] == commodity]

    # scatter
    data_scatter = data.set_index(data["mp_date"])[value]

    # mean
    data_mean = data.groupby(by=[data.mp_date.dt.year])[value].mean()

    # median
    data_median = data.groupby(by=[data.mp_date.dt.year])[value].median()

    # naive world mean (only make sense if normalized)
    data_world_mean = food_df[
        food_df["mp_year"] >= data["mp_year"].min()].groupby(
            by="mp_year")[value].mean(
            ) if value == "mp_price_norm" else pd.Series()

    # render the line
    title = f"{country}: {commodity}"
    line_render(data_scatter=data_scatter,
                data_mean=data_mean,
                data_median=data_median,
                data_world_mean=data_world_mean,
                title=title)

linehisto_ui(line_data)

interactive(children=(ToggleButtons(description='Value', index=1, options=('mp_price', 'mp_price_norm'), value…

### [Histogram] Distributies

In [306]:
def make_histo_trace(trace_data, xbins, name):
    return go.Histogram(x=trace_data, xbins=xbins, name=name)

def histo_render(data):
    """plot a histogram matrix"""
    # get all possible years
    years = data.index.year.unique().values

    # use the subplot helper to assign axis labels and do part of the math
    rows = math.ceil(len(years) / 3)
    cols = 3
    fig_histo = tools.make_subplots(rows=rows,
                                    cols=cols,
                                    print_grid=False, 
                                    subplot_titles=years.astype('str'),
                                    vertical_spacing=0.3/rows)
    fig_histo.layout.update(showlegend = False)
    fig_histo.layout.update(height = 150 * rows)

    # properly set the bins
    xbins = dict(
        start=0,
        end=data.max(),
        size=data.max() / 100,
    )

    # construct all individual traces
    for i in range(len(years)):
        trace_data = data[data.index.year == years[i]]
        name = f"{years[i]}"
        histo_trace = make_histo_trace(trace_data, xbins, name)
        fig_histo.append_trace(histo_trace, math.ceil((i + 1) / 3), i % 3 + 1)

    iplot(fig_histo)

    
def histo_data(value, country, commodity):
    """seperate the transformations from ui and rendering"""
    data = food_df if country == "World" else food_df[food_df["adm0_name"] == country]
    data = data if commodity == "All" else data[data["cm_name_grouped"] == commodity]

    # render the histogram matrix
    histo_render(data.set_index(data["mp_date"])[value])   

linehisto_ui(histo_data)

interactive(children=(ToggleButtons(description='Value', index=1, options=('mp_price', 'mp_price_norm'), value…

### [Table] Descriptives

In [307]:
def descriptives_data(value, country, commodity):
    """calculate complete descriptives"""
    data = food_df if country == "World" else food_df[food_df["adm0_name"] == country]
    data = data if commodity == "All" else data[data["cm_name_grouped"] == commodity]
    
    # per year     
    descriptives = data.groupby(by=["mp_year"])[value].describe()
    descriptives["skew"] = data.groupby(by=["mp_year"])[value].skew()
    descriptives["kurtosis"] = data.groupby(by=["mp_year"])[value].apply(pd.DataFrame.kurt)
    
    # totals    
    totals = data[value].describe().rename("Total")
    totals["skew"] = data[value].skew()
    totals["kurtosis"] = data[value].kurtosis()
    
    # do the final bits    
    descriptives = descriptives.append(totals)
    descriptives["cv"] = descriptives["std"] / descriptives["mean"]
    descriptives = descriptives[['count', 'mean', 'std', 'cv', 'min', '25%', '50%', '75%', 'max', 'skew','kurtosis']]
    
    # style it
    descriptives = descriptives.style.background_gradient(cmap="viridis")
    
    # render them
    display(descriptives)

linehisto_ui(descriptives_data)

interactive(children=(ToggleButtons(description='Value', index=1, options=('mp_price', 'mp_price_norm'), value…

### [Line + Histo + Descript combination]

[TODO] get rid of the (None, None) created by the lambda. Somehow seems to be harder to remove that it should be

In [114]:
# linehisto_ui(lambda **kwargs: (line_data(**kwargs), histo_data(**kwargs)))
# linehisto_ui(lambda **kwargs: (line_data(**kwargs), histo_data(**kwargs), descriptives_data(**kwargs)))

### [SPLOM] Correlatie tussen landen en commodities

[TODO] niet af

In [318]:
def corr_matrix_render(data, treshhold):
    corr_matrix = data.unstack(level="adm0_name").corr().unstack()
#     display(data.unstack(level="adm0_name").corr())
    corr_matrix_ss = corr_matrix[(corr_matrix > treshhold) | (corr_matrix < -treshhold)].unstack()

    trace = go.Heatmap(x=corr_matrix_ss.index,
                       y=corr_matrix_ss.index,
                       z=corr_matrix_ss,
                       zmin=-1,
                       zmax=1,
                       showscale=False)

    axis = dict(tickfont=dict(size=7), showticklabels=True, automargin=True)

    layout = go.Layout(autosize=True,
                       width=200 + 11 * len(corr_matrix_ss.index),
                       height=200 + 11 * len(corr_matrix_ss.index),
                       yaxis=axis,
                       xaxis=axis,
                       title="Correlation")

    corr_fig = go.Figure(data=[trace], layout=layout)

    iplot(corr_fig)


def splom_render(data):
    trace1 = go.Splom(dimensions=data)
    layout = go.Layout(title='Commodity correlations', height=1000)
    fig = go.Figure(data=[trace1], layout=layout)
    iplot(fig)


def correlation_data(country, commodity, treshhold=0.8):
    commodity = food_df.groupby(by=["adm0_name", "cm_name_grouped", "mp_year"])["mp_price_norm"].mean().xs("Rice", level="cm_name_grouped")
    sliced = commodity[commodity.index.get_level_values(level="adm0_name").isin(best_countries[:5])]
    groups = sliced.groupby(level="adm0_name").apply(lambda df: df.values)
    data = [{
        "label": label,
        "values": values
    } for label, values in groups.items()]
    #     splom_render(data)
    corr_matrix_render(commodity, treshhold)


interact(correlation_data,
         treshhold=(0, 1, 0.1),
         country=food_df["adm0_name"].unique(),
         commodity=best_commodities)

interactive(children=(Dropdown(description='country', options=('Afghanistan', 'Algeria', 'Angola', 'Armenia', …

<function __main__.correlation_data(country, commodity, treshhold=0.8)>

## Visualisatie: Map

Okee, een kaart met interactiviteit. Vond de Iphyton docs eigenlijk fijner.

[interactieve plots] https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6  
[iphython widget docs] https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html

### Food change maps

In [115]:
def map_graph(commodity=best_commodities, year=(2007, 2019, 1)):
    # Plot the mean price, of all commodities on map, of all years.
    series = select_commodity(commodity, year)
    map_data = [go.Choropleth(
        locations = series.index,
        locationmode = "country names",
        z = series.values,
    )]

    fig = go.Figure(data = map_data)
    iplot(fig)
    
interact(map_graph, commodity=best_commodities, year=(1996, 2019, 1))

# interact(map_climate, year=(1996, 2013, 1))


# def slider(year):
#     map_graph("Fuel", year)
# #     map_climate(year)
    
# interact(slider, year=(2007, 2013, 1))




interactive(children=(Dropdown(description='commodity', options=('Rice', 'Maize', 'Oil', 'Wheat', 'Beans', 'Su…

<function __main__.map_graph(commodity=Index(['Rice', 'Maize', 'Oil', 'Wheat', 'Beans', 'Sugar', 'Sorghum', 'Meat',
       'Millet', 'Fuel'],
      dtype='object', name='cm_name_grouped'), year=(2007, 2019, 1))>

## Controle

Om te controleren of de dataset correct is moeten we natuurlijk even een test draaien. Ik heb globale voedselprijzen even als benchmark genomen. Daar ligt waarschijnlijk een sterke correlatie, zeker voor gebieden die niet direct in conflict zijn.

[imf commodity prices data] https://www.imf.org/en/Research/commodity-prices  
[interessante analyse voedsel] https://ourworldindata.org/food-prices  
[interessante statistische analyse van commodities] https://www.imf.org/~/media/Files/Research/CommodityPrices/WEOSpecialFeature/SFApril2019.ashx  


In [97]:
from sklearn import datasets, linear_model
from scipy import stats

# selecteer rijst en bereken het gemiddelde
commodity_imf_df = pd.read_csv("./commodity_imf_proper.csv",
                               dtype=float,
                               decimal=",")
commodity_imf_df = commodity_imf_df.set_index('Year')
climate_series = climate_tran_df.groupby(
    level=1).mean()["AverageTemperature_diff"].loc[1994:2013]


def normalize_series(series):
    # bootstrap a new normalizer
    min_max_scaler = preprocessing.MinMaxScaler()

    # reshape the series to array so that the normalizer accepts it
    to_normalize = series.values.reshape(-1, 1)

    # do the actual normalisation
    x_scaled = min_max_scaler.fit_transform(to_normalize)

    # undo some weird numpy nesting of arrays
    x_scaled = np.concatenate(x_scaled).ravel()

    # trow the series
    return pd.Series(x_scaled, index=series.index)


def line_graph(food):
    # create an average of all countries
    test = grouped_commodity_diff_df.xs(food, level=1, drop_level=False).mean()
    fuel = grouped_commodity_diff_df.xs("Fuel", level=1,
                                        drop_level=False).mean()

    # use the IMF data to compute a nomalized and diffed line too
    control = normalize_series(commodity_imf_df[food]).diff()

    test_trace = go.Scatter(name="WFP", x=test.index, y=test.values)

    control_trace = go.Scatter(name="IMF",
                               x=control.index[-25:],
                               y=control.values[-25:])
    fuel_trace = go.Scatter(name="Fuel", x=fuel.index, y=fuel.values)

    climate_trace = go.Scatter(name="Climate",
                               x=climate_series.index,
                               y=climate_series.values)

    #     display(test)
    #     display(climate_series)
    #     display(fuel)
    
    control.name = "IMF"
    test.name = "Rice"
    climate_series.name = "Klimaat"
    fuel.name = "Fuel"
    
    scatter_data = pd.DataFrame([control, test, climate_series, fuel]).T
#     display(scatter_data.corr())

    #     iplot([test_trace, control_trace, fuel_trace, climate_trace], filename='totalchart')
    
    xi = scatter_data['Rice'].iloc[-25:-1].values
    y = scatter_data['IMF'].iloc[-25:-1].values
    slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)
    line = slope*xi+intercept
    display(line)
    
    
#     display(regr.predict(scatter_data['Rice'].iloc[-25:-1].values.reshape(-1, 1), scatter_data['IMF'].iloc[-25:-1].values.reshape(-1, 1)))
    
    trace1 = go.Splom(dimensions=[
        dict(label='Rice', values=scatter_data['Rice']),
        dict(label='IMF', values=scatter_data['IMF']),
        dict(label='Klimaat', values=scatter_data['Klimaat']),
        dict(label='Fuel', values=scatter_data['Fuel'])
    ],stff =1 )

    axis = dict(showline=True, zeroline=False, gridcolor='#fff', ticklen=4)

    layout = go.Layout(title='FOOD',
                       dragmode='select',
                       width=600,
                       height=600,
                       autosize=False,
                       hovermode='closest',
                       plot_bgcolor='rgba(240,240,240, 0.95)',
                       xaxis1=dict(axis),
                       xaxis2=dict(axis),
                       xaxis3=dict(axis),
                       xaxis4=dict(axis),
                       yaxis1=dict(axis),
                       yaxis2=dict(axis),
                       yaxis3=dict(axis),
                       yaxis4=dict(axis))

    fig1 = dict(data=[trace1], layout=layout)
    iplot(fig1, filename='splom-iris1')


interact(line_graph, food=["Rice", "Wheat", "Sugar"])


NameError: name 'climate_tran_df' is not defined

## Resultaat

Er lijkt wel iets van een correlatie te zijn, maar er zitten ook rare verschillen tussen. Metname de pieken zijn veel hoger. Maar duidelijk is dat 2008 een grote impact heeft gehad. Ik denk dat we nog even goed moeten kijken naar hoe we normaliseren. Het is best mogelijk om hiermee te spelen lijkt het, maar wat verantwoord is twijfel ik aan.

# Klimaat

Nu eens kijken of het samenhangt met klimaat. Ik heb het gevoel dat hier enige correlatie vinden hier moeilijk gaat worden.

## Load data

In [None]:

climate_df = pd.read_csv("./climate_data/GlobalLandTemperaturesByCountry.csv", low_memory=False)
display(climate_df.head())


## Type and normalize

Here we go again.

1. Type the datetime
2. Normalize
3. Create year averages
4. Diff those averages to have something comparable

This is just a tad different data though. It's cumulative data, its not the same as money. And the effects are much longer term. Doesn't make sense.

In [None]:
# type properly
climate_df_typed = climate_df.copy()
climate_df_typed['dt'] = climate_df_typed['dt'].astype('datetime64[ns]')

# apply the normalize function (with a partial for the key)
climate_norm_df = climate_df_typed.groupby(by=["Country"]).apply(partial(normalize_group, "AverageTemperature"))

## Transform and select

In [None]:
# calulate the mean for the year
climate_mean_df = climate_norm_df.groupby(by=["Country", climate_norm_df["dt"].dt.year]).mean()
climate_mean_df.reset_index(inplace=True)

# quick function with assignment
def diff_temp(group_df):
    group_df["AverageTemperature_diff"] = group_df["AverageTemperature_norm"].diff()
    return group_df

# apply the difference function
climate_tran_df = climate_mean_df.groupby(by=["Country"]).apply(diff_temp).set_index(["Country", "dt"])

# quick selection function
def select_year(year):
    return climate_tran_df.xs(year, level=1, drop_level=False)

## Map: Tempratuur verschil met vorig jaar

Spreekt voor zich, maar of het interessant is is de tweede vraag.

In [None]:
def map_climate(year):
    year_data = select_year(year)
    map_data = go.Choropleth(locations=year_data.index.get_level_values(0),
                             locationmode="country names",
                             z=year_data["AverageTemperature_diff"].values)
    
    
    
    fig = go.Figure(data=[map_data])

    iplot(fig)

interact(map_climate, year=(1996, 2013, 1))



In [None]:
from ipywidgets import Layout, Button, Box

layout = {
    'geo': {
        'domain': {
            'x': [0.0, 0.33],
            'y': [0, 1.0]
        }
    },
    'geo2': {
        'domain': {
            'x': [0.33, 0.66],
            'y': [0, 1.0]
        }
    },
    'geo3': {
        'domain': {
            'x': [0.66, 0.99],
            'y': [0, 1.0]
        }
    }
}

layout = {
    'geo': {
        'domain': {
            'x': [0.0, 0.5],
            'y': [0, 1.0]
        },
        #         "scope": "africa",
        "showframe": False,
        "lonaxis": {
            "range": [-30, 170]
        },
        "lataxis": {
            "range": [-35, 45]
        }
    },
    'geo2': {
        'domain': {
            'x': [0.5, 1],
            'y': [0, 1.0]
        },
        #         "scope": "africa",
        "showframe": False,
        "lonaxis": {
            "range": [-30, 170]
        },
        "lataxis": {
            "range": [-35, 45]
        }
    }
}


def map_all(year):
    climate_year_data = select_year(year)
    climate_trace = go.Choropleth(
        locations=climate_year_data.index.get_level_values(0),
        locationmode="country names",
        z=climate_year_data["AverageTemperature_diff"].values,
        geo="geo1",
        showscale=False)

    rice_year_data = select_commodity("Rice", year)
    rice_trace = go.Choropleth(locations=rice_year_data.index,
                               locationmode="country names",
                               z=rice_year_data.values,
                               geo="geo2",
                               showscale=False,
                               zmin=-0.5,
                               zmax=0.5)

    climate_trace2 = go.Choropleth(
        locations=climate_year_data.index.get_level_values(0),
        locationmode="country names",
        z=climate_year_data["AverageTemperature_diff"].values,
        geo="geo3",
        showscale=False,
        zmin=-0.5,
        zmax=0.5)

    rice_trace2 = go.Choropleth(locations=rice_year_data.index,
                                locationmode="country names",
                                z=rice_year_data.values,
                                geo="geo4",
                                showscale=False)

    #     fuel_year_data = select_commodity("Fuel", year)
    #     fuel_trace = go.Choropleth(locations=fuel_year_data.index,
    #                                locationmode="country names",
    #                                z=fuel_year_data.values,
    #                                geo="geo3",
    #                                showscale=False)

    fig = go.Figure(data=[climate_trace, rice_trace], layout=layout)
    iplot(fig)


IntSlider(description='A too long description', style=)

interact(map_all, year=(2007, 2013, 1))

In [None]:

items_layout = Layout(width='50%')     # override the default width of the button to 'auto' to let the button grow

box_layout = Layout(display='flex',
                    flex_flow='column',
                    align_items='stretch',
                    justify_content='center',
                    width='100%')

words = ['correct', 'horse', 'battery', 'staple']
items = [Button(description=word, layout=items_layout) for word in words]
box = Box(children=items, layout=box_layout)
box

# Sparklines Climate

Misschien is een totaalbeeld beter. 

In [None]:
import plotly.plotly as py
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv')
display(df.head())

data = []
layout = dict(
    title = 'New Walmart Stores per year 1962-2006<br>\
Source: <a href="http://www.econ.umn.edu/~holmes/data/WalMart/index.html">\
University of Minnesota</a>',
    # showlegend = False,
    autosize = False,
    width = 1000,
    height = 900,
    hovermode = False,
    legend = dict(
        x=0.7,
        y=-0.1,
        bgcolor="rgba(255, 255, 255, 0)",
        font = dict( size=11 ),
    )
)
years = df['YEAR'].unique()

for i in range(len(years)):
    geo_key = 'geo'+str(i+1) if i != 0 else 'geo'
    lons = list(df[ df['YEAR'] == years[i] ]['LON'])
    lats = list(df[ df['YEAR'] == years[i] ]['LAT'])
    # Walmart store data
    data.append(
        dict(
            type = 'scattergeo',
            showlegend=False,
            lon = lons,
            lat = lats,
            geo = geo_key,
            name = str(years[i]),
            marker = dict(
                color = "rgb(0, 0, 255)",
                opacity = 0.5
            )
        )
    )
    # Year markers
    data.append(
        dict(
            type = 'scattergeo',
            showlegend = False,
            lon = [-78],
            lat = [47],
            geo = geo_key,
            text = [years[i]],
            mode = 'text',
        )
    )
    layout[geo_key] = dict(
        scope = 'usa',
        showland = True,
        landcolor = 'rgb(229, 229, 229)',
        showcountries = False,
        domain = dict( x = [], y = [] ),
        subunitcolor = "rgb(255, 255, 255)",
    )


def draw_sparkline( domain, lataxis, lonaxis ):
    ''' Returns a sparkline layout object for geo coordinates  '''
    return dict(
        showland = False,
        showframe = False,
        showcountries = False,
        showcoastlines = False,
        domain = domain,
        lataxis = lataxis,
        lonaxis = lonaxis,
        bgcolor = 'rgba(255,200,200,0.0)'
    )

# Stores per year sparkline
layout['geo44'] = draw_sparkline({'x':[0.6,0.8], 'y':[0,0.15]}, \
                                 {'range':[-5.0, 30.0]}, {'range':[0.0, 40.0]} )
data.append(
    dict(
        type = 'scattergeo',
        mode = 'lines',
        lat = list(df.groupby(by=['YEAR']).count()['storenum']/1e1),
        lon = list(range(len(df.groupby(by=['YEAR']).count()['storenum']/1e1))),
        line = dict( color = "rgb(0, 0, 255)" ),
        name = "New stores per year<br>Peak of 178 stores per year in 1990",
        geo = 'geo44',
    )
)

# Cumulative sum sparkline
layout['geo45'] = draw_sparkline({'x':[0.8,1], 'y':[0,0.15]}, \
                                 {'range':[-5.0, 50.0]}, {'range':[0.0, 50.0]} )
data.append(
    dict(
        type = 'scattergeo',
        mode = 'lines',
        lat = list(df.groupby(by=['YEAR']).count().cumsum()['storenum']/1e2),
        lon = list(range(len(df.groupby(by=['YEAR']).count()['storenum']/1e1))),
        line = dict( color = "rgb(214, 39, 40)" ),
        name ="Cumulative sum<br>3176 stores total in 2006",
        geo = 'geo45',
    )
)

z = 0
COLS = 5
ROWS = 9
for y in reversed(range(ROWS)):
    for x in range(COLS):
        geo_key = 'geo'+str(z+1) if z != 0 else 'geo'
        layout[geo_key]['domain']['x'] = [float(x)/float(COLS), float(x+1)/float(COLS)]
        layout[geo_key]['domain']['y'] = [float(y)/float(ROWS), float(y+1)/float(ROWS)]
        z=z+1
        if z > 42:
            break


layout = {**layout, "height":900, "width":1000 }
# display(data)

fig = { 'data':data, 'layout': layout}
iplot( fig, filename='US Walmart growth' )

# Events

Okee, wat moet ik hiervoor doen. Querys schrijven naar Google Big Query. Cool, eens kijken hoe dat werkt.

[GDELT database] https://www.gdeltproject.org/  
[forecasting op voedseldata] https://dataviz.vam.wfp.org/economic_explorer/price-forecasts-alerts  
[redelijke uitleg GDELT] http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf  
[Uitleg gebruikte codes] https://www.gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf  

## Download data

Er is een Python client voor bigquery. Het is best simpel eigenlijk.

1. De clientlib installeren
2. Credentials file exporten in je shell (ik heb het aan mijn ~/.bash_profile toegevoegd en die gesourced, werkt perfect - MAC/Linux-only helaas)
2. Draai de sample query
3. Doe dingen met het dataframe

[setup credentials] https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python  
[database description] https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events?pli=1  
[API reference] https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html

In [None]:
from google.cloud import bigquery
client = bigquery.Client()

# write a good query to the right table (lookup the database stuffz)
query = (
    "SELECT V2Themes FROM `gdelt-bq.gdeltv2.gkg` WHERE DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%' LIMIT 10;"
)

# create the query
query_job = client.query(
    query,
    location="US",
)

# run the query and turn the result to a dataframe
result = query_job.result().to_dataframe()

## Query

Wat moet het kunnen?

Het moet sentiment en count per land per jaar of maand geven.


In [None]:
display(result)

# Todo

## Food
Dit is onze hoofddataset, dus veel detailvisualisaties.

0. [ x ] Drop messy date columns  

  
1. [ x ] Scatter diagram  
    1.1. [ x ] Rransform van price naar price_norm   
    1.2. [ x ] Histogram  
    1.3. [  ] Histogram grid  
    1.4. [  ] Add skew and kurtosis to descriptives  
    
   
2. [ x ] Matrix   
    2.1. [ x ] Matrix transform interactief/y-as van goods naar country    
    2.2. [ x ] Matrix transform z-as van count naar price en price_norm  

  
3. [  ] Gridplot van een commodity per land om een diachroon perspectief te krijgen  
    3.1. [  ] mp_price/jaar  
    3.2. [  ] mp_price_norm/jaar  
    
  
4. [  ] Totaalplot  
    4.1. [  ] IMF Wereldcommodities aanvullen met scatterplot  
    4.2. [  ] Correlatiematrix binnen een commodity in een land maken  
    4.3. [  ] Slechte correlaties droppen  
    4.4. [  ] Correlatiematrix tussen dezelfde commodity tussen landen  
    4.5. [  ] Slechte correlaties droppen   
    4.6. [  ] Correlatie met wereldcommodities maken  
    

5. [  ] Correlatie plots  
    5.1. [  ] Correlation scatter grid  
    5.2. [  ] Correlellogram  


6. [  ] Map grid plot  
    6.1. Understand subplots for geographical axes  
    6.2. Find a good design for comparing  

**If time**

7. [  ] Detail in alle grafieken naar maanden verhogen  
    6.1. [  ] Matrix  
    6.2. [  ] Lijn  
    
   
8. [  ] Maps naar provincie of stadsniveau  




## Klimaat

### EDA
1. Dit is het zwakke van klimaat, ik heb eigenlijk nog geen goed overzicht van distributes en ander inzicht in de kwaliteit van die set. 
2. Iets van diachronaal perspectief op deze set, om uberhaubt te checken dat het geen nagenoeg rechte lijn van 1990-2013 is. 



## Events

### Opzetten

#### Gemiddelden checken 
1. Goede query opstellen. Dit wordt eigenlijk de grote crux. In principe moeten we BigQuery al het werk laten doen hier. Het resultaat dat we willen hebben is niet zo heel erg gedetailleerd. 
2. Transformeren naar een vergelijkbaar dataframe. 
3. Plotten in lijn en kaart

#### Individuele events checken
4. Eventdetectie in de food dataset
5. Event en food events tegen elkaar afzetten

### EDA
6. Alles

### Kwaliteit
7. Wat de data precies beschrijven en wat de bronnen zijn twijfel ik ook nog aan. Het is regelrecht gigantisch en het is een academisch project. Maar dit is wel een extreem complexe set, met veel artikelen over hoe het gecodeerd is. (denk Goldstein sentiment en counts.. van wat precies?)



## Correlaties
1. Proberen goederen te clusteren tot een enkele lijn per land / wereld
2. Visuele EDA op de correlaties
3. Correlaties van die lijn inferentieel onderzoeken en tot een coefficient omzetten voor uiteindelijke conclusie.
    - Nadenken over welke assumpties daar ook al weer voor golden en welke tests dus aan voldaan moet zijn.


