# Descriptive statistics and normality testing

## Imports

In [None]:
import pandas as pd
import scipy.stats as ss
import seaborn as sb
import statsmodels.api as sm

## Data loading

In [None]:
file_name = "../data/toy_descriptive_long.csv"
df = pd.read_csv(file_name)
df.head()

## Tabular analysis

- The `DataFrame` class from `pandas` provides the `describe` method for computing basic descriptive statistics.
- `describe` will ignore non-numeric columns

In [None]:
df.describe()

The previous code has computed the statistics for the whole group. 
But we really want to break compute values by group.
We can "subset" our data as follows.

In [None]:
df[df["group"] == "S2"]

The previous code gives us the subset of data from group S2.
Now we can run `describe` on that data.

In [None]:
df[df["group"] == "S2"].describe()

If we want to describe for all groups we can use the `groupby` method.

In [None]:
df.groupby(by="group").describe()

## Basic plots

### Plotting data points

In [None]:
sb.swarmplot(df, x="group", y="value")

### Boxplots of the data distribution

In [None]:
sb.boxplot(
    df,
    x="group",
    y="value",
)

### Interval plots

In [None]:
ax = sb.pointplot(
    df,
    x="group",
    y="value",
)

We need to get rid of the line connecting the groups.

In [None]:
ax = sb.pointplot(
    df,
    x="group",
    y="value",
    linestyles="none",
)

By default we are getting the mean and standard devation.
Let's change to the median and IQR.
We will also add some horizontal bars to the ends of the range.

In [None]:
ax = sb.pointplot(
    df,
    x="group",
    y="value",
    capsize=0.4,
    estimator="median",
    errorbar=("pi", 50),
    linestyles="none",
)

## Combining plots

We can combine different types of plots.
Let's start by adding the data points to our boxplot.

In [None]:
ax = sb.swarmplot(
    df,
    x="group",
    y="value",
    color="k",
)
ax = sb.boxplot(
    df,
    ax=ax,
    x="group",
    y="value",
    fill=False,
    color="k",
)
ax.set_xlabel("Study group")
ax.set_ylabel("Weight (g)")

We can do the same thing with our interval plots.

In [None]:
ax = sb.swarmplot(df, x="group", y="value", color="k")
ax = sb.pointplot(
    df,
    ax=ax,
    x="group",
    y="value",
    capsize=0.4,
    color="k",
    estimator="median",
    errorbar=("pi", 50),
    linestyle="none",
)

### Histograms

Pandas DataFrames can plot histograms with the built in `hist` method.

In [None]:
df.hist(by="group")

The default is pretty ugly.
We can lay things out one row and fix the figure size.

In [None]:
df.hist(by="group", figsize=(12, 4), layout=(1, 3))

While `pandas` has basic plotting support, `seaborn` is a lot more powerful.
Let's do the same thing in `seaborn`.

In [None]:
sb.displot(df, x="value", col="group")

The `distplot` function is a bit different than what we saw earlier, in that it produces multiple plots.
Instead of producing a single `Axes` object it produces a collection of `Axes` stored in a new object called a `FacetGrid`.
Let's clean up this plot.

In [None]:
fg = sb.displot(df, x="value", col="group", bins=10, color="k", fill=False)
fg.set_xlabels("Weight (g)")

## Normality testing

We will use the `scipy.stats` module which provides a large collection of statistical functions.
To compute the Shapiro-Wilks statistics we will use the `shapiro` function.
This functions takes in a sequence of numbers.

In [None]:
ss.shapiro(df["value"])

Again this is computing for the whole dataset.
We would rather do it by group.
So let's use the `groupby` method.

In [None]:
df.groupby(by="group")["value"].apply(ss.shapiro)

We don't need the test statistics.
Also let's round to 3 decimals.

In [None]:
df.groupby(by="group")["value"].apply(lambda x: round(ss.shapiro(x)[1], 3))

If we want to do QQ plots we will use `statsmodels`.
There is not a really nice way to do it by group, so we will manually subset.

In [None]:
sm.qqplot(df.loc[df["group"] == "S2", "value"]);

By default `qqplot` does not show any line.
We will add the one fit to the quantiles.

In [None]:
sm.qqplot(df.loc[df["group"] == "S2", "value"], line="q");