# Analysis and Visualization of Complex Agro-Environmental Data
---
## Interactive visualization in python

### 1. Bokeh

`Bokeh` is a Python's module for interactive da visualizations. The plots are created by stacking layers on top of each other. The first step is to create an empty figure, to which elements are added in layers. These elements are known as glyphs, which can be anything from lines to bars to circles. Attached to each glyph are properties such as color, size and coordinates.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


#### 1.1 Data preparation

Download 2 datasets: (1) CO2 emissions per person per year per country and (2) GDP per year per country:

In [2]:
url_co2 = 'https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/co2.csv'
co2 = pd.read_csv(url_co2)
url_gm = 'https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/gapminder.csv'
gm = pd.read_csv(url_gm)

Transform the dataset - intersect datasets

In [3]:
# Drop duplicates in gm
df_gm = gm[['Country', 'region']].drop_duplicates()
# Combine the 2 datasets (merge by country)
df_w_regions = pd.merge(co2, df_gm, left_on ='country', right_on ='Country', how ='inner') # intersection of both keys, keep order from 'left'
# Drop one Country column
df_w_regions = df_w_regions.drop('Country', axis='columns')
df_w_regions.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,region
0,Afghanistan,,,,,,,,,,...,0.0637,0.0854,0.154,0.242,0.294,0.412,0.35,0.316,0.299,South Asia
1,Albania,,,,,,,,,,...,1.28,1.3,1.46,1.48,1.56,1.79,1.68,1.73,1.96,Europe & Central Asia
2,Algeria,,,,,,,,,,...,2.99,3.19,3.16,3.42,3.3,3.29,3.46,3.51,3.72,Middle East & North Africa
3,Angola,,,,,,,,,,...,1.1,1.2,1.18,1.23,1.24,1.25,1.33,1.25,1.29,Sub-Saharan Africa
4,Antigua and Barbuda,,,,,,,,,,...,4.91,5.14,5.19,5.45,5.54,5.36,5.42,5.36,5.38,America


Transform the dataset - stack by year

In [4]:
# change the format of the DataFrame into one that has identifier variables of our choice (here, country and region).
new_co2 = pd.melt(df_w_regions, id_vars=['country', 'region'])
columns = ['country', 'region', 'year', 'co2']
new_co2.columns = columns
# Select data from 1964 onwards, sort by country and then year
df_co2 = new_co2[new_co2['year'].astype('int64') > 1963]
df_co2 = df_co2.sort_values(by=['country', 'year'])
df_co2['year'] = df_co2['year'].astype('int64')
df_co2.head()


Unnamed: 0,country,region,year,co2
28372,Afghanistan,South Asia,1964,0.0863
28545,Afghanistan,South Asia,1965,0.101
28718,Afghanistan,South Asia,1966,0.108
28891,Afghanistan,South Asia,1967,0.124
29064,Afghanistan,South Asia,1968,0.116


Create similar table for GDP per year per country

In [5]:
df_gdp = gm[['Country', 'Year', 'gdp']]
df_gdp.columns = ['country', 'year', 'gdp']
df_gdp.head()

Unnamed: 0,country,year,gdp
0,Afghanistan,1964,1182.0
1,Afghanistan,1965,1182.0
2,Afghanistan,1966,1168.0
3,Afghanistan,1967,1173.0
4,Afghanistan,1968,1187.0


Merge datasets

In [6]:
data = pd.merge(df_co2, df_gdp, on=['country', 'year'], how='left')
data = data.dropna()
data.head()

Unnamed: 0,country,region,year,co2,gdp
0,Afghanistan,South Asia,1964,0.0863,1182.0
1,Afghanistan,South Asia,1965,0.101,1182.0
2,Afghanistan,South Asia,1966,0.108,1168.0
3,Afghanistan,South Asia,1967,0.124,1173.0
4,Afghanistan,South Asia,1968,0.116,1187.0


#### 1.2 Running Bokeh

Import Bokeh and functions

In [7]:
from bokeh.io import curdoc, output_notebook
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper, Slider
from bokeh.palettes import Spectral6

In [8]:
from bokeh.layouts import Column, row

#### Prepare base static plot

In [9]:
# load BokehJS - enables the plot to be displayed within the notebook
output_notebook()

In [10]:
# create list of regions - to color the datapoints based on the region
regions_list = data.region.unique().tolist()
# assign colors to each region
color_mapper = CategoricalColorMapper(factors=regions_list, palette=Spectral6)

In [11]:
# make a data source for the plot
source = ColumnDataSource(data={
    'x': data.gdp[data['year'] == 1964],
    'y': data.co2[data['year'] == 1964],
    'country': data.country[data['year'] == 1964],
    'region': data.region[data['year'] == 1964],
})

In [12]:
# Save the minimum and maximum values of the gdp column: xmin, xmax
xmin, xmax = min(data.gdp), max(data.gdp)

# Save the minimum and maximum values of the co2 column: ymin, ymax
ymin, ymax = min(data.co2), max(data.co2)

In [13]:
# Create the figure: plot
plot = figure(title='CO2 Emissions vs GDP in 1964', 
              height=600, width=1000,
              x_range=(xmin, xmax),
              y_range=(ymin, ymax), y_axis_type='log')

In [14]:
# Add circle glyphs to the plot
plot.circle(x='x', y='y', fill_alpha=0.8, source=source, legend_field='region',
            color=dict(field='region', transform=color_mapper),
            size=7)

In [15]:
# Set the legend.location attribute of the plot
plot.legend.location = 'bottom_right'

# Set the x-axis label
plot.xaxis.axis_label = 'Income Per Person'

# Set the y-axis label
plot.yaxis.axis_label = 'CO2 Emissions (tons per person)'
show(plot)

#### Adding a hover tool

In [16]:
# Create a HoverTool - will allow the user to hover above a datapoint to see the name of the country, CO2 emissions nd GDP
hover = HoverTool(tooltips=[('Country', '@country'), ('GDP', '@x'), ('CO2 Emission', '@y')])

# Add the HoverTool to the plot
plot.add_tools(hover)

show(plot)

#### Adding a slider to the static plot

In [17]:
# Make a slider object: slider
# Set the start as the 1st year and the end as the last year in the year column
# Set step as 1 and the value as the minimum value of the year column
slider = Slider(start=min(data.year), end=max(data.year), step=1, value=min(data.year), title='Year')

# create function that will update the plot every time the silider is moved
def update_plot(attr, old, new):
    # set the `yr` name to `slider.value` and `source.data = new_data`
    yr = slider.value

    new_data = {
        'x': data.gdp[data['year'] == yr],
        'y': data.co2[data['year'] == yr],
        'country': data.country[data['year'] == yr],
        'region': data.region[data['year'] == yr],
    }
    source.data = new_data

    # Add title to figure: plot.title.text
    plot.title.text = 'CO2 Emissions vs GDP in %d' % yr


# Attach the callback to the 'value' property of slider
slider.on_change('value', update_plot)

In [18]:
layout = row(Column(slider), plot)
curdoc().add_root(layout)

#### Now open the terminal command line and change the dir to the folder that contains this jupyter notebook. Type the following command and wait until the plot is displayed in your web browser:

> bokeh serve --show Interactive_plots.ipynb

### 2. Plotly

`plotly` is a very popular Python module used to create interactive data visualizations. It is a JSON-based plotting tool, and so every plot is defined by 2 JSON objects - data and layout. 

A simplified and more user friendly version of `plotly` is `plotly express` which is provides a high-level wrapper around the base `plotly code, resulting in a minimized syntax abd commands.

#### Creating an interactive scatter plot

We will use the same dataset created for Bokeh in the previous example.

In [19]:
# Save the minimum and maximum values of the gdp column: xmin, xmax
xmin, xmax = min(data.gdp), max(data.gdp)
# Save the minimum and maximum values of the co2 column: ymin, ymax
ymin, ymax = min(data.co2), max(data.co2)

In [20]:
import pandas as pd
import plotly.express as px

fig = px.scatter(data, x="gdp", y="co2", animation_frame="year", animation_group="country",
           color="region", hover_name="country", facet_col="region", width=1579, height=400,
           log_x=True, size_max=45, range_x=[xmin,xmax], range_y=[ymin,ymax])

fig.show()

In [21]:
# Aggregate in a single plot (by removing facet_col="region") and add rug and boxplot.
fig = px.scatter(data, x="gdp", y="co2", animation_frame="year", 
        color="region", hover_name="country", width=1000, height=600,
        size_max=45, range_x=[xmin,xmax], range_y=[ymin,ymax], 
        marginal_y = 'box', marginal_x = 'rug') # add a boxplot in the left side and rug plot on top

fig.show()

In [22]:
# N0w using a density contour plot instead of a scatter plot
fig = px.density_contour(data, x="gdp", y="co2", animation_frame="year", 
        color="region", hover_name="country", width=1000, height=600,
        range_x=[xmin,xmax], range_y=[ymin,ymax], 
        marginal_y = 'box', marginal_x = 'rug') # add a boxplot in the left side and rug plot on top

fig.show()

In [23]:
fig = px.scatter(data, x="year", y="co2", color="region", hover_name="country", width=1000, height=500,
           size_max=45, marginal_y = 'box')

fig.show()

#### Visualizing an output of Principal Component Analysis

##### Example using the `Iris` dataset

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [25]:
# pairwise scatter plots
df = px.data.iris()
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]

fig = px.scatter_matrix(
    df,
    dimensions=features,
    color="species",
    width=800, height=700
)

fig.update_traces(diagonal_visible=False)
fig.show()

In [26]:
# PCA plots
pca = PCA()
components = pca.fit_transform(df[features])
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(4),
    color=df["species"],
    width=800, height=700
)
fig.update_traces(diagonal_visible=False)

fig.show()

##### Example using the `winequality` dataset

In [31]:
from sklearn.preprocessing import StandardScaler

df_wine = pd.read_csv('winequality_red.csv')
df_wine2 = df_wine.iloc[:, 0:11]
wine_scaled = StandardScaler().fit_transform(df_wine2)
df_scaled = pd.DataFrame(data=wine_scaled, columns=df_wine2.columns)


pca = PCA()
components = pca.fit_transform(df_scaled)
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(5), # Try to change the number of PC's (from 2 to 11, in this case)
    color=df_wine["quality"],
    width=1000, height=900
)
fig.update_traces(diagonal_visible=False)

fig.show()

### 3. Dash

Python's `dash` module offers a framework for building interactive data visualization interfaces. `dash` helps to build interactive web applications and dashboards to visualize data without requiring advanced web development knowledge.

Bellow you can find a example of a `dash` interactive plot to visualize PCA plots with a user defined number of components. To run the app, copy the code into a new file named pca-visualization.py and type into your terminal the following code:

> python pca-visualization.py

Then, go to the http link by using 'ctr... + mouse click'.

To exit: 'ctr + c'

#### PCA with `Iris` Dataset

In [32]:
# ATENTION: THIS CODE DOES NOT WORK IN JUPYTER NOTEBOOK! 
# Need to copy-paste to a *.py file and run in command line (see above).

# Import modules
# import Dash, dcc (stands for Dash Core Components - this module includes a Graph component called dcc.Graph, 
# which is used to render interactive graphs amd dcc.slider to render an interactive slider).
# We also import sklearn.decomposition.PCA to run a PCA, the plotly.express library to build the interactive graphs, 
# and pandas to work with DataFrames.

from dash import Dash, dcc, html, Input, Output
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd
from sklearn.preprocessing import StandardScaler

df_wine = pd.read_csv('winequality_red.csv')
df_wine2 = df_wine.iloc[:, 0:11]
wine_scaled = StandardScaler().fit_transform(df_wine2)
df = pd.DataFrame(data=wine_scaled, columns=df_wine2.columns)

# Initialize the app
# This line is known as the Dash constructor and is responsible for initializing your app. 
# It is almost always the same for any Dash app you create.
app = Dash(__name__)

# App layout
# The app layout represents the app components that will be displayed in the web browser, 
# normally contained within a html.Div.
app.layout = html.Div([
    html.H4("Visualization of PCA's explained variance"),
    dcc.Graph(id="pca-visualization-x-graph"),
    html.P("Number of components:"),
    dcc.Slider(
        id='pca-visualization-x-slider',
        min=2, max=4, value=2, step=1)
])

# Add controls to build the interaction
# The inputs and outputs of our app are the properties of a particular component. 
# The output is the figure property of the component with the ID "pca-visualization-x-graph"
# THe input is the value property of the component that has the ID "pca-visualization-x-slider".
# The callback function's argument 'n_components' refers to the component property of the input. 
# We build PCA plots inside the callback function, assigning the chosen value in the slider. 
# This means that every time the user selects the number of components with the slider, the figure is rebuilt
# to add more or less components
# Finally, we return the scatter plots at the end of the function. 
# This assigns the plots to the figure property of the dcc.Graph, thus displaying the figure in the app.
@app.callback(
    Output(component_id="pca-visualization-x-graph", component_property="figure"), 
    Input(component_id="pca-visualization-x-slider", component_property="value"))

def run_and_plot(n_components):
    pca = PCA(n_components=n_components) # defines the number of components in the PCA
    components = pca.fit_transform(df) # fits a PCA
    var = pca.explained_variance_ratio_.sum() * 100 # % of explained variance by each PC
    labels = {str(i): f"PC {i+1}" for i in range(n_components)} # PC labels
    labels['color'] = 'quality'
    fig = px.scatter_matrix(
        components,
        color=px.df_wine()['quality'],
        dimensions=range(n_components),
        labels=labels,
        title=f'Total Explained Variance: {var:.2f}%')
    fig.update_traces(diagonal_visible=False)
    return fig

# Run the app - These lines are for running your app, and they are almost always the same for any Dash app you create.
if __name__ == "__main__":
    app.run_server(debug=True)

Dash is running on http://127.0.0.1:8050/

 * Serving Flask app '__main__'
 * Debug mode: on


SystemExit: 1


To exit: use 'exit', 'quit', or Ctrl-D.



#### PCA with `Wine` Dataset

In [None]:
# ATENTION: THIS CODE DOES NOT WORK IN JUPYTER NOTEBOOK! 
# Need to copy-paste to a *.py file and run in command line (see above).

# Import modules
# import Dash, dcc (stands for Dash Core Components - this module includes a Graph component called dcc.Graph, 
# which is used to render interactive graphs amd dcc.slider to render an interactive slider).
# We also import sklearn.decomposition.PCA to run a PCA, the plotly.express library to build the interactive graphs, 
# and pandas to work with DataFrames.

from dash import Dash, dcc, html, Input, Output
from sklearn.decomposition import PCA
import plotly.express as px
import pandas as pd

# Initialize the app
# This line is known as the Dash constructor and is responsible for initializing your app. 
# It is almost always the same for any Dash app you create.
app = Dash(__name__)

# App layout
# The app layout represents the app components that will be displayed in the web browser, 
# normally contained within a html.Div.
app.layout = html.Div([
    html.H4("Visualization of PCA's explained variance"),
    dcc.Graph(id="pca-visualization-x-graph"),
    html.P("Number of components:"),
    dcc.Slider(
        id='pca-visualization-x-slider',
        min=2, max=4, value=2, step=1)
])

# Add controls to build the interaction
# The inputs and outputs of our app are the properties of a particular component. 
# The output is the figure property of the component with the ID "pca-visualization-x-graph"
# THe input is the value property of the component that has the ID "pca-visualization-x-slider".
# The callback function's argument 'n_components' refers to the component property of the input. 
# We build PCA plots inside the callback function, assigning the chosen value in the slider. 
# This means that every time the user selects the number of components with the slider, the figure is rebuilt
# to add more or less components
# Finally, we return the scatter plots at the end of the function. 
# This assigns the plots to the figure property of the dcc.Graph, thus displaying the figure in the app.
@app.callback(
    Output(component_id="pca-visualization-x-graph", component_property="figure"), 
    Input(component_id="pca-visualization-x-slider", component_property="value"))

def run_and_plot(n_components):
    df = px.data.iris().iloc[:,0:4]
    pca = PCA(n_components=n_components) # defines the number of components in the PCA
    components = pca.fit_transform(df) # fits a PCA
    var = pca.explained_variance_ratio_.sum() * 100 # % of explained variance by each PC
    labels = {str(i): f"PC {i+1}" for i in range(n_components)} # PC labels
    labels['color'] = 'species'
    fig = px.scatter_matrix(
        components,
        color=px.data.iris()["species"],
        dimensions=range(n_components),
        labels=labels,
        title=f'Total Explained Variance: {var:.2f}%')
    fig.update_traces(diagonal_visible=False)
    return fig

# Run the app - These lines are for running your app, and they are almost always the same for any Dash app you create.
if __name__ == "__main__":
    app.run_server(debug=True)

#### Variation of human population per country 1959-2007

In [None]:
from dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminder_unfiltered.csv')

app = Dash(__name__)

app.layout = html.Div([
    html.H1(children='Human Population Size', style={'textAlign':'center'}),
    dcc.Dropdown(df.country.unique(), 'Canada', id='dropdown-selection'), # dropdown menu
    dcc.Graph(id='graph-content') 
])

@callback(
    Output('graph-content', 'figure'),
    Input('dropdown-selection', 'value')
)
def update_graph(value):
    dff = df[df.country==value]
    return px.line(dff, x='year', y='pop') # Line graph

if __name__ == '__main__':
    app.run_server(debug=True)

## References

Dash Python User Guide. https://dash.plotly.com/

Dash in 20 Minutes. https://dash.plotly.com/tutorial

Interactive Data Visualization with Python. https://github.com/TrainingByPackt/Interactive-Data-Visualization-with-Python 

PCA Visualization in Python. https://plotly.com/python/pca-visualization/

Plotly Express in Python. https://plotly.com/python/plotly-express/

Plotly Open Source Graphing Library for Python. https://plotly.com/python/

3 Cool Features of Python Altair. https://towardsdatascience.com/3-cool-features-of-python-altair-deb3f432cc11