In [2]:
library(tidyverse)
library(vegan)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.4     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# Modern ecological problems

**This is a course about community ecology and not so much about population ecology.** Community ecology underpins the vast fields of biodiversity and biogeography, and concerns spatial scales from squares of meters to all of Earth. We can look at historical, contemporary, and future processes that have been implicated in shaping the distribution of life on our planet. 

Community ecologists tend to analyse how multiple environmental factors act as drivers that influence the distribution of tens or hundreds of species. These data tend to often be messy (not in the sense of untidy data as per the 'tidyverse' definition of tidy data, but it can be that too!) and statistical considerations need to be understood within the context of the data available to us. This translates to errors of measurement and errors due to extreme values, the presence of a few very rare or very abundant species, autocorrelated residuals (due to repeated sampling, for example), colinearity, etc. These challenges make to application of 'basic' statistical approaches problematic, and a new branch of inferential and exploratory statistical needs to be followed. These approaches involve techniques that allow us to work with all the data at once, and because it can simultaneously analyse all the variables (multiple environmental drivers acting on multiple species at multiple places and across multiple times), this group of statistics is called 'multivariate statistics.' There are two main groups of multivariate statistics: 'classifications' and 'ordinations.' Classification generally concerns placing samples (species or environments) into groups or hierarchies of groups, while ordination is best suited for analyses that involve arranging samples along gradients. Often they complement each other, but we shall see later that each approach has its own strengths. Irrespective of the analysis, the data share a few characteristics.

## Biodiversity: patterns and processes

### Background: the IUCN definition of biodiversity

### Alpha-, beta-, and gamma-diversity

### Whittaker's concept of beta-diversity, and contemporary interpretations

### The relationship between alpha- and beta-diversity

### Historical, neutral, and niche theories

### Species assembly: turnover and nestedness-resultant beta diversity

### Global change (climate change etc.)

## Where do ecological data come from?

Information is all around us. It has existed before sentient humans began questioning "Life, the Universe and Everything." During these early times, humans often invented silly answers, especially during the time before the scientific age when the tools and ways of thinking about problems became available. Today we have access to the ways and means to question the world around us, and we may arrive at objective answers (which are hopefully no longer silly). This module concerns large amounts of quantitative data (information turned into numbers) about our world. These quantitative data have been collected over many hundreds of years (give examples of long data sets), and they continue to be collected at increasing rates, over increasing spatial scales, and at finer and finer resolution. And because of advanced deterministic general circulation models that offer a predictive capability, which may be coupled via an ecophysiological understanding of how plants and animals and things that are neither plants nor animals react to environmental stimuli, we may project how the biota may respond in the future.

Let us consider some of the sources of information (data) that we will be able to analyse and turn into knowledge using the tools available to the quantitative ecologist.

### Field sampling

### Historical data

### Remotely sensed data

### Modelled data (projections)

## Properties of the data sets
Ecological data sets are usually arrange in a *matrix*, which **has species (or higher level taxa, whose resolution depends on the research question at hand) arranged as columns** and **samples (typically the sites, stations, transects, time, plots, etc.) as rows**. We call this a **sites × species** table. In the case of environmental data it will of course be a **sites × environment** table. The term 'sample' may be applied differently compared to how we used it in the [Basic Statistics Workshop](https://robwschlegel.github.io/Intro_R_Workshop/); here we use it to denote the basic unit of observation. Samples may be quadrats, transects, stations, sites, traps, seine net tows, trawls, grids cells on a map, etc. It is important to clearly and unambiguously define the basic unit of the sample in the paper's Methods section.

Example species and environmental data sets are displayed below in Figures 2.1-2.4. The species matrix here comprises distribution records of 846 macroalgal species within each of 58 × 50 km-long sections along South Africa's coastline. So, the matrix has 58 rows, one for each sample (here each of the coastal sections), and 846 columns, one for each of the seaweed species found in South Africa. Some of the coastal sections do not have a species present and it will simply be coded as 0 (for not present in the case of presence/absence data, or 0 units of biomass or abundance, etc.). The matching environmental data set has information about various measurements of seawater temperature and chlorophyll-*a* content --- their names are along the columns, and there are 18 of them. It is important that a sample of the environment is available for each of the seaweed samples, so there will also be 58 rows present in this data set. So, it is a matching data set in the sense that each sample of species data is matched by a sample of the environment (both have 58 rows). Using this data set, it was the intention of Smit et al. (2017) to describe the gradients in seaweed distribution as a function of the underlying seawater temperatures.

Species data may be recorded as various kinds of measurements, such as presence/absence data, biomass, or abundance. 'Presence/absence' of species simply tells us the the species is there or is not there. It is binary. 'Abundance' generally refers to the number of individuals per unit of area or volume, or to percent cover. 'Biomass' refers to the mass of the species per unit of area or volume. The type of measure will depend on the taxa and the questions under consideration. The important thing to note is that all species have to be homogeneous in terms of the metric used (i.e. all of it as presence/absence, or abundance, or biomass, not mixtures of them). The matrix’s constituent row vectors are considered the species composition for the corresponding sample (i.e. a row runs across multiple columns, and this tells us that the sample is comprised of all the species whose names are given by the column titles --- note that in the case of the data in Figure 2.1-2.2, it is often the case that there are 0s, meaning that not all species are present at some sites). Species composition is frequently expressed in terms of relative abundance; i.e. constrained to a constant total such as 1 or 100%. The environmental data may be heterogenous, i.e. the units of measure may differ among the variables. For example, pH has no units, concentration of some nutrient has a unit of (typically) μM, elevation may be in meters, etc. Because these units differ so much, and because they therefore have different magnitudes and ranges, we may need to standarise them. The purpose of multivariate analysis is to find patterns in these complex sets of data, and to explain why these patterns are present.

Many community data matrices share some general characteristics:

* Most species occur only infrequently. The majority of species is typically represented at only a few locations, and these species contribute little to the overall abundance. This results in sparse matrices, as we see in Figures 2.1-2.2, where the bulk of the entries consists of zeros.

* Ecologists tend to sample a multitude of factors that they think influence species composition, so the matching environmental data set will also have multiple (10s) of columns that will be assessed in various hypotheses about the drivers of species patterning across the landscape. For example, fynbos biomass may be influenced by the fire regime, elevation, aspect, soil moisture, soil chemistry, edaphic features, etc.

* Even though we may capture a multitude of information about many environmental factors, the number of important ones is generally quite low --- i.e. a few factors can explain the majority of the explainable variation, and it is our intention to find out which of them is most important.

* Much of the signal may be spurious. Variability is a general characteristic of the data, and this may result in false patterns emerging. This is so because our sampling may capture a huge amount of stochasticity (processes that are entirely non-deterministic), which may mask the real pattern of interest. Imaginative and creative sampling may reveal some of the patterns we are after, but this requires long years of experience and is not something that can easily be taught as part of our module.

* There is a huge amount of collinearity. Basically, what this means is that although it is often the case that many explanatory variables are able to explain patterning, only a few of them act in a way that implies causation. Colinearity is something we will return to later on.

## What do we do with these data?

We follow the principles of reproducible research, and throughout we will implement and practice modern data analytical methods. These approaches concern i) data entry, ii) data management, iii) data-wrangling, iv) analysis, and v) reporting, and we shall discuss each under the next headings.

### Initial data entry

### Meta-data and data management

### Data-wrangling: pre-processing and quality assurance

### Analysis

### Reporting

## This module

Practical applications of the quantitative ecological methods will be applied to studying the patterning of species along gradients, and the classification of landscapes (based on plant and animal assemblages) into clusters using some measures of (dis-)similarity.