## Reminder

We came a long way in that first tutorial. A reminder of what we did and where to find it:

- [Getting the data into Python](./tutorial.ipynb#Getting-the-data-into-Python)
    -  using `read_csv` and dealing with missing data
- [Accessing columns](./tutorial.ipynb#Accessing-the-columns)
    -  using dot notation and square brackets
    -  setting the index
    -  using `loc`
- [Sorting and filtering](.tutorial.ipynb#Sorting-and-filtering)
    -  the `sort_values` function
    -  how to get documentation
    -  default arguments
    -  passing a Boolean to `loc[]`
    -  compound filters
- [Summary statistics](./tutorial.ipynb#Summary-statistics)
    -  not so useful for this data set but good to know
- [Investigating relationships](./tutorial.ipynb#Investigating-relationships)
    -  drawing scatter plots in `pandas`
    -  drawing better scatter plots in `seaborn`
    -  getting the correlation coefficient
- [Time series](./tutorial.ipynb#Time-Series)
    -  plotting simple time series
    -  applying a calculation and creating new columns



## Practice

![Blackbird](https://www.rspb.org.uk/globalassets/images/birds-and-wildlife/bird-species-illustrations/blackbird_male_1200x675.jpg?preset=landscape_mobile)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

1. Import the `blackbirds.csv` data into a `pandas` dataframe.
1. How many rows are there in your dataframe? (Try `.size`)
1. Is there a sensible index in the dataframe?
1. What do each of the columns represent? What do you think the age values mean?
1. Find the mean and standard deviation (`std`) of the wing span and weight columns.
1. Use the documentation to check *which* standard deviation you're getting.
1. Use the `quantile` function to find the median and the IQR too.
1. Is there a relationship between wing span and weight? Visualise it and measure it.
1. Use the `hue`, `size`, `style` and `markers` of the `seaborn` [scatterplot function](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) to distinguish between the different kinds of blackbird in your plot.
1. Find the mean and standard deviation weight and wing span of adult female and male blackbirds separately.
1. What other questions could you ask of this data set?

### Q1 and Q2

In [None]:
blackbirds = pd.read_csv("blackbirds.csv")
len(blackbirds)

### Q3

There isn't anything unique in the columns to index by.

### Q4

The values in the age column are Juvenile, First year, Adult and Unknown

### Q5

In [None]:
blackbirds.Weight.std()

### Q6

The default `ddof` argument is 1, which means the denominator will be $n-1$, so this is sample standard deviation by default.

### Q7

In [None]:
print("Weight: The median is {}, with IQR {}".format(blackbirds.Weight.quantile(0.5),
                                                     blackbirds.Weight.quantile(0.75)-blackbirds.Weight.quantile(0.25)))
print("Wing: The median is {}, with IQR {}".format(blackbirds.Wing.quantile(0.5),
                                                     blackbirds.Wing.quantile(0.75)-blackbirds.Wing.quantile(0.25)))

But actually,

In [None]:
blackbirds.describe()

The `Year` column shouldn't really work like that. If you check `blackbirds.dtypes` you'll see why.

In [None]:
blackbirds.Year = pd.to_datetime(blackbirds.Year,format="%Y")
blackbirds.describe()

Check `blackbirds.dtypes` again.

### Q8

In [None]:
blackbirds.plot.scatter("Wing","Weight")

In [None]:
blackbirds.corr()

### Q9

In [None]:
plt.figure(figsize=(12,10))

sns.scatterplot(data=blackbirds,x="Wing",y="Weight",hue="Age", style="Sex", palette="Spectral", alpha=0.5)

### Q10

In [None]:
blackbirds.loc[(blackbirds.Sex=='M')&(blackbirds.Age=='A')].describe()

In [None]:
blackbirds.loc[(blackbirds.Sex=='F')&(blackbirds.Age=='A')].describe()

But this feels like good opportunity to see the `groupby` function:

In [None]:
blackbirds.groupby(["Age","Sex"]).mean()

### Q11

## Boxplots

We can use grouped boxplots to see how weight and wing span change with age

In [None]:
# Make a figure with two subplots with a shared y-axis
fig, axs = plt.subplots(1,2, sharey=True)
# axs is a list so we can get the first subplot with ax[0]
sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0])
# and the second with ax[1]
sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1]);

## Distribution plots

`seaborn` has a `distplot` function the combines a histogram with an estimate of the continuous distribution shape

In [None]:
# *** broken ***
sns.distplot(blackbirds.Weight)

In [None]:
fig,axs = plt.subplots(1,2)
sns.distplot(blackbirds.Weight.dropna(),bins=10, ax=axs[0])
sns.distplot(blackbirds.Wing.dropna(),bins=10, ax=axs[1])

Use `distplot` to compare the distribution of weight and the wing span for female and male blackbirds

In [None]:
fig, axs = plt.subplots(1,2)
fig.suptitle("Weight and wingspan distribution by sex")

axs[0].get_yaxis().set_visible(False)
sns.distplot(blackbirds[blackbirds.Sex=='M'].Wing.dropna(),color="Green", ax=axs[0], label='M', bins=10)
sns.distplot(blackbirds[blackbirds.Sex=='F'].Wing.dropna(),color="Purple", ax=axs[0], label='F', bins=10)

axs[1].get_yaxis().set_visible(False)
sns.distplot(blackbirds[blackbirds.Sex=='M'].Weight.dropna(),color="Green", ax=axs[1], label='M')
sns.distplot(blackbirds[blackbirds.Sex=='F'].Weight.dropna(),color="Purple", ax=axs[1], label='F')

axs[1].legend();

What does this suggest?

## Hypothesis testing

It looks like the mean wing span for female blackbirds is different from the mean for males. How should we test that?

The `scipy` package has a function for doing t-tests

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html

In [None]:
from scipy import stats

In [None]:
stats.ttest_ind(blackbirds.loc[blackbirds.Sex == 'M',"Wing"].dropna(),blackbirds.loc[blackbirds.Sex == 'F',"Wing"].dropna(),
               equal_var=False)

What can we conclude? Was this a one or a two-tailed test? Does it matter?

## Time series

Let's look at how weight and wing span have varied over time

In [None]:
# A groupby by itself doesn't do very much
blackbirds.groupby(by="Year")

In [None]:
blackbirds.groupby(by="Year").mean()

In [None]:
blackbirds.groupby(by="Year").mean().plot();

## Ordinal data

It so happened that A, F, J and U worked quite well because they're in alphabetical order. But it would be better to tell `pandas` what order we really mean them to come in.

In [None]:
blackbirds.Age = pd.Categorical(blackbirds.Age, categories=["U","J","F","A"])

In [None]:
fig, axs = plt.subplots(1,2, sharey=True)
sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0])
sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1]);

Investigate the optional arguments for boxplots. What definition of outlier is used?