# 2. Still Descriptive...

# The outliers....

![title](Outlier.jpg)

The outlier has a value that is far from the bulk of the data (can go on either direction).

These outliers are highly damaging as they can drive a particular analysis one way or the other

![title](bad_outlier.gif)

In Multivariate analysis such as PCA (which we will cover soon) outliers tend to dominate the first main components, thus in some circumstances driving opposite conclusions.

However, outliers can have a substantial analytical model, it can point to interesting behaviors of the data, be of biological relevance or illustrate relevant flaws with a particular design. 

### !!!Do Not Remove Outliers Before Investigating What do they Represent!!!

## How to detect outliers!!
Graphical means of detecting outliers work the best, let's look at boxplots

### Boxplots

The boxplot identifies the center of the data (median) and the spread (either variance or standard deviation or the 25% - 75% quartiles. beyond the box usually statistical software draw a line going up and down from the center of the box, these represent 1.5 times the spread. Points or circles beyond these lines represent observations that occur past all these spread measurements.



#### Let's look at an example.

This dataset was used on a study conducted by Cruikshanks et al 2006, the main goal was to identify acid-sensitive water in coastal rives in Ireland. Using pH as a function of SDI (Sodium Dominance Index), the altitude of the site and the presence of absence of forest. Let's look at the boxplot

In [None]:
IrishpH <- read.table(file = "IrishpH.txt",
                      header = TRUE,
                      dec = ".")
library(car)
par(mar = c(5,5,2,2), cex.lab = 1.5)
Boxplot(IrishpH$Altitude, ylab = "Altitude")
stripchart(IrishpH$Altitude, 
           vertical = TRUE, method = "jitter", 
           pch = 21,add = TRUE,col=rgb(1, 0, 0,0.5)) 

In [None]:
library(mosaic)
favstats(IrishpH$Altitude)

We can also should look at all of the variables in our model which will be part of the analysis, to have a good idea of how they will behave. Let’s use another data set, in this case we will use  data use by Ligas (2008). In their study they look at the effect of month and sex on cephalothorax length of the red swamp crayfish *Procambarus clarkii*. They use multiple variables to test their model (weight, sex, month, and sexual maturity of 746 crayfish individuals).

We can construct a **conditional boxplot** that evaluates the change of thorax length at different months.

In [None]:
Crayfish <- read.table(file = "Procambarus.txt",
                         header = TRUE,
                         dec = ".")

In [None]:
head(Crayfish)

In [None]:
#install.packages("car")
library(car)
Boxplot(CTL ~ Month,
        ylab = "Cephalothorax Length",
        xlab = "Month", 
        data= Crayfish,
        main = expression(italic("Procambarus clarkii")))

stripchart(CTL ~ Month,data = Crayfish, 
           vertical = TRUE, method = "jitter", 
           pch = 21, 
           add = TRUE,col=rgb(1, 0, 0,0.5)) 

let’s stop for a moment here and review an important extra piece of information that these boxplots also gives us. 

When we are comparing multiple variables in a parametric statistical test (where normality is assumed), one of the main conditions to be able to compare across variables is that there is **homogeneity of variance (called homoscedasticity)**. This happens when the spread of all values of the population is the same for every value of the covariate. 

For example: looking at the Crayfish conditional box plot, we see that most of the classes of our variable have a similar patterns of spread, except for the second class (Mar_05) where the variance is much smaller and seems skewed. One quick read to the points spread seems to illustrate that there is low sampling that can be skewing the distribution.

There are multiple statistical tests that allows us to test for homogeneity of variances, such as the Bartlett test, the F-ratio test, and the Levene’s test among others.

#### Another useful visualization technique is the violin plot. It is similar to a box plot with a rotated kernel density plot.

Continuing with the Irish water quality dataset, lets construct violong plots to the same data we evaluated before

In [None]:
library(ggplot2)
ggplot(data = Crayfish, aes(x = Month, y = CTL)) +
  geom_boxplot(alpha = 0.2) +
  geom_violin(fill='red', color='red',  alpha=0.4) +
  geom_jitter(alpha = 0.6, color = "black") + 
  theme_bw()

It is important to note here that there are a lot of missing values. Missing values can also have an effect on the behavior of the data and to our results.

We need to understand if the missing values that we have have a biological basis, an artifact of the sampling, or simply clerical errors.

We can count the number of missing values with the function is.na in R.

How can we deal with zeros in the data??

In [None]:
sum(is.na(Crayfish$CTL))

## Cleveland dotplots

Another interesting way of looking at the data is using dotplots, which basically we plot the row number of an observation vs the observed value, the y-axis shows how the data is ordered and the x-axis shows the values.

Let’s look at the Irish pH dataset again

In [None]:
par(mar = c(5,5,2,2), cex.lab = 1.5, cex.main = 1.5)
dotchart(IrishpH$Altitude,
         main = "Altitude",
         ylab = "Order of the data",
         xlab = "Range of the data")
           

In [None]:
head(IrishpH)

In [None]:
ggplot(IrishpH, aes(ID,Altitude)) +
  geom_point(stat = "identity") +
  geom_text(data=subset(IrishpH, Altitude > 400),
            aes(ID,Altitude,label=Altitude),hjust = -0.2)+
  coord_flip()

### What to do if you suspect that there are outliers in your data?

1. If you are sure they are outliers, remove them
2. Run the models with and without the outliers, present this data with analysis.
3. Apply a transformation

### Transformations

Transformations change the dispersion of the data. As the transformation is applied to all elements from the data, there is no problem with biasing the data. 

There are multiple types of transformation (see here for a complete review [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3043340/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3043340/)
The three most used are logarithmic, square root, and reciprocal.

We can check the example from the homework.

In [None]:
bimodalData_s = read.csv(file = "plant_heights.csv",header = T)

bimodalData_s$log = log(bimodalData_s$x)
bimodalData_s$log10 = log(bimodalData_s$x,10)




In [None]:
a = runif(1000)
plot(a)
hist(a)



In [None]:
plot(bimodalData_s$x,seq(1:10000))

In [None]:
d= density(bimodalData_s$x)
par(mfrow = c(1,2))
plot(d)

qqnorm(bimodalData_s$x)

In [None]:
plot(bimodalData_s$log10,seq(1:10000))

## Histograms

As we have seen previously, histograms are useful when we want to check for normality (which is important if we want to apply some statistical tests). The histogram aims to show the center and distribution of the data

To create an histogram we first create a frequency table,
par(mfrow = c(1,3))
hist(bimodalData_s$log)

d= density(bimodalData_s$log10)
plot(d)

qqnorm(log(bimodalData_s$x,10))

In [None]:
x = round(rnorm(1000,0,1),1)


In [None]:
table(x)

In [None]:
hist(x)

In [None]:
#Lets run another example let's run the dataset sparrows and select one species
##lets plot the histogram of weights

Sparrows = read.table(file = "Sparrows.txt", header = TRUE)
Sparrows2 = Sparrows[Sparrows$Species == "SSTS",]
hist(Sparrows2$Wt, xlab = "Weight in grams", main = expression(italic("Ammodramus caudacutus")))

### We can also draw multiple histograms using the lattice package

In [None]:
Sparrows2

In [None]:
library(lattice)
histogram(~Wt|factor(Observer),
         data = Sparrows2,
         layout = c(1,7),
         nint = 30,
         xlab = "Weight in grams",
         strip = FALSE,
         strip.left = TRUE,
         ylab = "Frequencies")

## Going back to our initial example we can change the size of the bins to make it more define as a continuous distribution

In [None]:
hist(x, breaks = 40)

In [None]:
hist(x, breaks = 400) ##but we loose resolution

#### However, with this shape, it is still a little difficult to assess whether the raw data follows a normal distribution. Ir order to solve this problem we can use Kernel Density Curves

## Kernel Density Curves

A kernel density is a non-parametric way to calculate an empirical PDF of a random variable X, it uses a basic smooting parameter that affects the shape of the curves.

As explained by Leno and Zuur 2015, a kernel defines small functions that are added up as a smoothing function. 

In [None]:
d = density(Sparrows$Wt)

In [None]:
plot(d)

In [None]:
Sparrows$fSpecies <- factor(Sparrows$Species,
                            levels = c("SESP","SSTS"),
                            labels = c("A.maritimus",
                                       "A.caudacutus"))
                                  
par(mar = c(5,5,2,2), cex.lab = 1.5)


plot(d,
     xlab = "Weights (in grams)",
     cex.lab = 1.5,
     cex.main = 1.5,
     main = "",
     xlim = c(15,28),
     ylim = c(0, 0.35),
     lwd = 5)

d1 <- density(Sparrows$Wt[Sparrows$fSpecies == "A.maritimus"])
d2 <- density(Sparrows$Wt[Sparrows$fSpecies == "A.caudacutus"])

lines(d1, lty = 2, lwd = 2)
lines(d2, lty = 3, lwd = 2)

legend("topright",
        legend = expression("Both Species",
                        italic("A. maritimus"),
                        italic("A. caudacutus")),
        lty = c(1, 2, 3),
        lwd = c(5, 2, 2))

### How does the smoothing function works?

#### In the density function the smoothing parameter is called bw (bandwith) it ranges from 0 to 1 

In [None]:
d_ex = density(x, bw = 0.01)
plot(d_ex)

In [None]:
bw_ex = c(0.1,0.5,1)
col2 = c("red","blue","forestgreen")
plot(d_ex)
for (i in 1:3){
    d_ex = density(x, bw = bw_ex[i])
    lines(d_ex, col = col2[i])
}

## Another way to graph the distribution of a continuous variables is using a quantile-quantile plot (Q-Q plot)

### Q-Q plots

The basic idea is to plot the quantiles distributions of two variables against each other. We then draw a straight line and if the plots approximately fall within the boundaries of the line then we can infer that the distributions are similar.

We can plot the quantiles from the raw data against any distribution quantiles and check for similitudes of the distributions.

Following our initial example

In [None]:
qqnorm(x)
qqline(x)

### In the sparrow example, we can also compare the distribution of weights for the two species using the Q-Q plots using the package lattice

In [None]:
qqmath (~Wt|fSpecies,
        data = Sparrows,
        cex = 1, col = 1,
        ylab = list("Weight (in grams)", cex.lab = 1.5),
        xlab = list("Theoretical Quantiles", cex.lab = 1.5),
       prepanel = prepanel.qqmathline,
       panel = function(x, ...) {
          panel.qqmathline(x, ...)
          panel.qqmath(x, ...)
       })