# Introduction



## Exploratory data analysis

Exploratory data analysis is an activity where you ....... *explore your data*. It's often conducted towards the beginning of a data science or analysis workflow and is an interactive process where you build up your familiarity with the data; identify its structure and patterns; spot noise, errors, and missing values; and begin to formulate research questions and hypotheses.  

Chapter 7 of R for Data Science <a href="https://r4ds.had.co.nz/exploratory-data-analysis.html" target="_blank">(Wickham and Grolemund, 2017)</a> provides an excellant overview of techniques for exploratory data analysis. They suggest that two questions should guide your initial exploration of datasets:

* What type of variation occurs within variables?
* What type of covariation occurs between variables?

A variable is a property or feature of interest that can be measured and a value is the state of the variable when it was measured. Columns in a `DataFrame` or `GeoDataFrame` or bands in raster often correspond to variables and cells in a table or pixels in a raster correspond to values for an observation. 


## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab or Binderhub. **If you're working with Google Colab and Binderhub be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/data-analysis-3300-3003/colab/blob/main/lab-3-eda.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a href="https://mybinder.org/v2/gh/binder-3300-3003/binder/HEAD" target="_blank">
  <img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder"/>
</a>

### Download data

If you need to download the data for this lab, run the following code snippet. 

In [None]:
import os

if "week-3" not in os.listdir(os.getcwd()):
    os.system('wget "https://github.com/data-analysis-3300-3003/data/raw/main/data/week-3.zip"')
    os.system('unzip "week-3.zip"')

### Working in Colab

If you're working in Google Colab, you'll need to run the following code snippet to install the required packages that don't come with the colab environment.

In [None]:
!pip install geopandas
!pip install pyarrow
!pip install mapclassify
!pip install rasterio
!pip install libpysal
!pip install esda
!pip install splot

### Import modules

In [None]:
# Import modules
import os
import pandas as pd
import geopandas as gpd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import rasterio
import plotly.io as pio
import esda

from splot.esda import plot_moran
from libpysal.weights import KNN, lag_spatial
pio.renderers.default = "colab"

In [None]:
# Load the crop yield data
crop_yield_data_path = os.path.join(os.getcwd(), "week-3")

# Get a list of crop yield data
crop_yield_data_files = os.listdir(crop_yield_data_path)

# Combine the geojson files into one GeoDataFrame
dfs = []

for i in crop_yield_data_files:
    if i.endswith(".geojson"):
        print(f"Loading file {i} into a Geopandas GeoDataFrame")
        tmp_df = gpd.read_file(os.path.join(crop_yield_data_path, i))
        dfs.append(tmp_df)

gdf = pd.concat(dfs, axis=0)

## Data summaries

An initial data exploration task is to produce summary statistics for the variables in our datasets. Pandas `DataFrame`s and GeoPandas `GeoDataFrame`s have a `describe()` method which generates a `DataFrame` of summary statistics for each variable. 

In [None]:
# Describe our DataFrame of crop yield data
gdf.describe()

`describe()` returns to us a count of the number of observations in variable, the mean value for observations in a variable, and summary statistics describing the distribution and range of values (standard deviation, percentiles (median = 50th percentile), and min and max values).

However, there are two main groups in our dataset: canola observations and wheat observations denoted by the `Variety` column. It will be more informative to generate summary statistcs for each group separately. We can do this using the Pandas `groupby()` function which splits an `DataFrame` into subsets based upon a grouping variable, computes statistics for each subset, and then combines the results. Here, we need to `groupby()` `Variety` to generate summary statistics for each crop type. 

We'll also generate these summary statistics within a context manager (denoted by a `with` block). This context allows us to change the default display values for a `DataFrame` only for this context without affecting the global defaults that apply to the rest of the notebook. This is a useful trick in case you have a particular need to control how a `DataFrame` is displayed (e.g. printing all rows as specified by the `display.max_rows` option).

In [None]:
with pd.option_context('display.max_rows', None, 'display.float_format', lambda x: '%.3f' % x):
    display(gdf.groupby(["Variety"]).describe())

This still isn't a very helpful layout to view the summary statistics as not all of the statistics can be displayed. Let's transpose the summary statistics so rows become columns and vice versa using the `T` transpose operator.

In [None]:
with pd.option_context('display.max_rows', None, 'display.float_format', lambda x: '%.3f' % x):
    display(gdf.loc[:, ["Variety", "DryYield", "gndvi", "ndyi"]].groupby(["Variety"]).describe().T)

## Data distributions

The mean tells us the average value for a variable. However, it is susceptible to outliers and extreme values. Therefore, it is important to view the mean and the median (50th percentile) together as the median is not affected by extreme values. 

However, neither the mean or the median reveal the spread or distribution of values for a variable. The min and max values tell us what the range of values are for a variable. This can be useful for detecting potential measurement error and noise (e.g. is the max value for wheat yield sensible?). However, the min and max values can be affected by extreme values and don't tell us anything about the shape or density of the distribution of data values. 

The inter-quartile range (difference between the 75th and 25th percentile values) tells us how spread out the data is around the median and the standard deviation tells us how spread out the data is around the mean. Assuming a normal distribution, ~68% of the values are within one standard deviation of the mean. Thus, if the standard deviation is small relative to the mean it indicates there is not much spread in the data away from the mean.

It is often useful to visualise the distribution of variables. A histogram is a common visualisation for distributions. The height of the bars of a histogram correspond to the count of values that fall within the bin. The width of the bar corresponds to the bin width. 

In [None]:
fig = px.histogram(
    data_frame=gdf, 
    x="DryYield", 
    facet_col="Variety", 
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()

The choice of bin width will affect the histogram. Too small a bin width will lead to small peaks in the distribution of data values being visualised which can obscure the dominant pattern in the data. Conversely, too large a bin width could mask important parts of a variable's distribution. The `histogram()` function from Plotly Express has a `nbins` parameter that can be used to specify the number of bins.

### Outliers

The majority of data values on the histograms above are concentrated on the far left of the figure. If you zoom in you will see there are a few isolated extreme or outlier yield values, which are masking the dominant pattern of the distribution. Detecting outliers is an important part of exploratory data analysis. 

Now that outliers have been detected we need to fix or remove them. A common way to detect outliers is to use a threshold based on percentile or standard deviation values. Here, we'll say an outlier is any value that is more or less than three standard deviations from the mean. 


In [None]:
# Canola
df_canola = gdf.loc[gdf["Variety"] == "43Y23 RR", :]
print(f"There are {df_canola.shape[0]} canola rows BEFORE dropping outliers")
df_canola = df_canola.loc[(df_canola["DryYield"]-df_canola["DryYield"].mean()).abs() < (3*df_canola["DryYield"].std()), :]
print(f"There are {df_canola.shape[0]} canola rows AFTER dropping outliers")

# Wheat
df_wheat = gdf.loc[gdf["Variety"] == "Ninja", :]
print(f"There are {df_wheat.shape[0]} wheat rows BEFORE dropping outliers")
df_wheat = df_wheat.loc[(df_wheat["DryYield"]-df_wheat["DryYield"].mean()).abs() < (3*df_wheat["DryYield"].std()), :]
print(f"There are {df_wheat.shape[0]} wheat rows AFTER dropping outliers")

In [None]:
# combine filtered dfs
gdf_clean = pd.concat([df_canola, df_wheat], axis=0)

In [None]:
fig = px.histogram(
    data_frame=gdf_clean, 
    x="DryYield", 
    facet_col="Variety", 
    marginal="box", 
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()

Instead of dropping rows where there are extreme crop yield values, we can also replace outlier values with a more sensible value such as the mean.  

In [None]:
mean_yield = gdf.loc[:, ["Variety", "DryYield"]].groupby(["Variety"]).mean()
mean_yield

In [None]:
df_canola = gdf.loc[gdf["Variety"] == "43Y23 RR", :]
df_canola.loc[(df_canola["DryYield"]-df_canola["DryYield"].mean()).abs() > (3*df_canola["DryYield"].std()), "DryYield"] = mean_yield.iloc[0, 0]

df_wheat = gdf.loc[gdf["Variety"] == "Ninja", :]
df_wheat.loc[(df_wheat["DryYield"]-df_wheat["DryYield"].mean()).abs() > (3*df_wheat["DryYield"].std()), "DryYield"] = mean_yield.iloc[1, 0]

# combine filtered dfs
gdf_replaced = pd.concat([df_canola, df_wheat], axis=0)

In [None]:
fig = px.histogram(
    data_frame=gdf_replaced, 
    x="DryYield", 
    facet_col="Variety", 
    marginal="box", 
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()

Looking at the wheat yield histogram we can see there are a large number of zero or close to zero values. This is another strange artefact in the distribution of our data values. Are zero crop yield values actually no crop yield from the plant or a source of measurement error or other noise? If the latter, we should remove these noisy values. 

In [None]:
df_canola = gdf_clean.loc[gdf_clean["Variety"] == "43Y23 RR", :]
print(f"The number of canola observations with yield values of zero or less is: {(df_canola['DryYield'] <= 0).sum()}")
df_wheat = gdf_clean.loc[gdf_clean["Variety"] == "Ninja", :]
print(f"The number of wheat observations with yield values of zero or less is: {(df_wheat['DryYield'] <= 0).sum()}")

In [None]:
df_canola.loc[df_canola["DryYield"] <= 0, "DryYield"] = np.nan
df_wheat.loc[df_wheat["DryYield"] <= 0, "DryYield"] = np.nan

# combine filtered dfs
gdf_with_nan = pd.concat([df_canola, df_wheat], axis=0)

# drop NAs
gdf_dropped_nan = gdf_with_nan.dropna(subset=["DryYield"])

In [None]:
fig = px.histogram(
    data_frame=gdf_dropped_nan, 
    x="DryYield", 
    facet_col="Variety", 
    marginal="box", 
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()

Now we've removed zero values the distribution of our crop yield values looks more sensible and relatively normally distributed.

### 2D histograms

We can use 2D histograms or density heatmaps to look at the distribution of two variables together. 2D hisograms are a useful complement to scatter plots when you have a large number of observations. Here, colour is used to represent the distribution of data values as opposed to the height of rectangular bars on a histogram.

Let's create 2D histograms to visualise the relationship between vegetation indices and canola crop yield.

In [None]:
fig = px.density_heatmap(
    data_frame=gdf_dropped_nan.loc[gdf_dropped_nan["Variety"] == "43Y23 RR", :], 
    x="DryYield", 
    y="gndvi", 
    marginal_x="box", 
    marginal_y="box",
    range_y=[0.4, 0.8])
fig.show()

In [None]:
fig = px.density_heatmap(
    data_frame=gdf_dropped_nan.loc[gdf_dropped_nan["Variety"] == "43Y23 RR", :], 
    x="DryYield", 
    y="ndyi", 
    marginal_x="box", 
    marginal_y="box",
    range_y=[0.1, 0.5])
fig.show()

### Violin plots

One of the limits of using histograms to visualise distributions is the size of the bins affects the distribution of data values. An alternative approach to visualising a distribution is to use a violin plot. 

Violin plots display use density and box plots to visualise distributions, which look similar to violins. The density is the probability of an observation taking on a certain value and is plotted as a smooth curve. Areas where the curve is fatter indicate a higher probability that an observation will take that value. Box plots display the 25th, 50th, and 75th percentile values.

In [None]:
fig = px.violin(
    gdf_dropped_nan, 
    y="DryYield", 
    x="Variety", 
    color="Variety", 
    box=True, 
    points="outliers", 
    hover_data=["DryYield", "gndvi"])
fig.show()

## Covariation and correlation

Variation describes the distribution of values within a variable, covariation describes how values vary between two variables. Postive covariance indicates that as the values of one variable increase so do the values of the other variable (or they both decrease together). Negative covariance indicates that as values of one variable increase the values of the other variable decrease. 

We can use scatter plots to explore the covariance in our data. We did this above when we looked at the relationships between NDYI and GNDVI and crop yield.

We can compute the covariance between two variables as:

$$Cov(X,Y)=\frac{1}{N}\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})$$

However, a limit of using the covariance to measure association and relationships between variables is that it is affected by the units of measurement. For example, if we had the same crop yield measurements but in units of tonnes per hectare of kilograms per hectare we'd get different covariance scores. Therefore, the correlation coefficient is often used as a measure of association between variables. 

$$Corr{X,Y} = \frac{Cov(X,Y)}{sd(X) \cdot sd(Y)} = \frac{\sigma_{XY}}{\sigma{X}\sigma{Y}}$$

The correlation coefficient is bound between -1 and 1 with -1 indicating perfect negative correlation and 1 indicating perfect positive correlation. 

Pandas has a <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html" target="_blank">`corr`</a> function that can be used to compute correlation coefficients between columns in a `DataFrame`. Let's compute the correlation coefficients between crop yields and NDYI and GNDVI.

In [None]:
gdf_corr = gdf_dropped_nan.loc[:, ["Variety", "DryYield", "gndvi", "ndyi"]].groupby(["Variety"]).corr()
gdf_corr

## Spatial correlation

Spatial correlation or spatial autocorrelation describes how similar values are for observations that are close to each other in space. Spatial autocorrelation is the degree to which similar values cluster together in space and is the "absence of spatial randomness" <a href="https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html" target="_blank">(Rey et al. 2020)</a>. Spatial randomness occurs when the values of observations display have no relationship with their location in space.

Some key terms related to spatial autocorrelation:

* *positive spatial autocorrelation*: the values of nearby locations are similar (e.g. a location and its neighbours both have high or low values).
* *negative spatial autocorrelation*: similar values are located far way from each other (less common than spatial autocorrelation). 
* *global spatial autocorrelation*: a description or summary of the general trend of spatial autocorrelation in the dataset.
* *local spatial autocorrelation*: a local measure of how similar an observation's values are to its neighbours which is useful for identifying clusters or similar or disimilar values in a dataset.

We can visually explore the spatial correlation in our data by plotting it on a map and using a suitable colour scale to map data values to colours. Let's visualise the canola crop yield values.

In [None]:
df_canola = gdf_dropped_nan.loc[gdf_dropped_nan["Variety"] == "43Y23 RR", :].copy()

fig = px.scatter_mapbox(
    df_canola, 
    lat=df_canola.geometry.y,
    lon=df_canola.geometry.x,
    color="DryYield", 
    mapbox_style="open-street-map")
fig.show()

Visually, we can see some spatial patterns in our canola yield data. High yield values with yellow shades appear to be clustered together in the field. 

While we can get a general sense of the degreen of spatial autocorrelation in our dataset by viewing the data on a map, this does not provide us with a formal method to quantify how spatially correlated the dataset is or how likely this level of spatial correlation is under a context of spatial randomness. 

The global Moran's I statistic can be used as a measure of spatial autocorrelation in the dataset and for undertaking statistical inference to assess how likely realising the observed pattern of spatial correlation is under a context of spatial randomness. This is analogous to statistical significance / hypothesis testing. 

The global Moran's I statistic is a measure the correlation between an observations values and its neighbours. The value of a neighbouring values is termed the spatial lag.

A scatter plot can help us build up our intuition of the Moran's I statistic. First, let's compute each canola yield observation's spatial lag (i.e. the average crop yield of it's four nearest neighbours here - there a range of methods for defining neighbouring values).

We do this by generating a spatial weights matrix. This is an object that keeps a record of the which data points are neighbours of a given observation in our `GeoDataFrame`. Here, we're defining neighbours using the K nearest neighbour rule; specifically the four nearest neighbours. 

Here, we row standardise the weights in the spatial weights matrix so each neighbour contributes equal weight and the sum of the neighbour's weights is one. This has the useful effect of the sum of each neighbours weight multiplied by its value returning the average value for all neighbours of an observation: its spatial lag.

In [None]:
# create a spatial weights matrix to identify a locations neighbouring values
w_knn = KNN.from_dataframe(df_canola, k=4)
# Row-standardization
w_knn.transform = "R"
print(f"the neighbours of the first observation in our dataset are {w_knn.neighbors[0]}")

# compute spatial lag of canola yield
df_canola["DryYield_SpatialLag"] = lag_spatial(w_knn, df_canola["DryYield"])

The Moran's plot is a scatter plot that visualises the variable of interest, crop yield here, against its spatial lag. This scatter plot reveals the pattern of spatial autocorrelation in the dataset. If we see a trend of values from the bottom left to top right corners of the graph it indicates there is positive spatial autocorrelation, a trend of values from top left to bottom right indicates negative spatial autocorrelation, and no visible trend indicates spatial randomness. 

The Moran's plot uses standardised values where the global mean for a variable is subtracted from each observation. 

The slope of the trend line on the scatter plot is the Moran's I statistic, it indicates the sign and strength of spatial correlation in the dataset.

In [None]:
# standardise yield measurements
df_canola["DryYield_std"] = df_canola["DryYield"] - df_canola["DryYield"].mean()
df_canola["DryYield_SpatialLag_std"] = df_canola["DryYield_SpatialLag"] - df_canola["DryYield_SpatialLag"].mean()

# morans plot
fig = px.scatter(
    df_canola,
    x = "DryYield_std",
    y = "DryYield_SpatialLag_std",
    trendline = "ols",
    opacity=0.05,
    labels={"DryYield_std": "Standardised Canola Yield",
           "DryYield_SpatialLag_std": "Standardised Canola Yield (spatial lag)"}
)
fig.add_hline(y=0, line_width=0.5)
fig.add_vline(x=0, line_width=0.5)
fig.show()

We can also compute the Moran's I statistic which is a statistical summary of the general level of spatial autocorrelation in the dataset.

In [None]:
moran = esda.moran.Moran(df_canola["DryYield"], w_knn)
print(f"Moran's I statistic is: {round(moran.I, 3)}")
print(f"P-value for Moran's I statistic is: {moran.p_sim}")

This p-value indicates the probability of obtaining a Moran's I statistic this large over 999 simulations with spatial randomness. This indicates that we can reject a null hypothesis of canola yield values being arranged in a spatially random manner in the field. 