In this notebook, we're going to learn about grouping and pivoting. These are two core concept in data science and will help us dive deeper into linguistic data. This week, (as you know) we're working with the TIMIT dataset.

In [None]:
# run this cell; don't worry about what it does yet.
from datascience import *
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
%matplotlib inline

## Grouping

First, let's read in the TIMIT dataset,

In [None]:
timit = Table.read_table('wk2-timit.csv')

and have a look at the first few rows.

In [None]:
timit.show(5)

What do you see? I see a Table with 11 columns. At first glance, it looks like each row in the Table represents one utterance of a particular vowel. That's a lot of data. Grouping helps us summarize the data by collapsing certain columns. For example, if we group by the `word` column, we can see how often each word appears in the whole dataset.

In [None]:
timit.group(["word"])

Figuring out the number of times a certain value appears is quite a common task in data science. The general approach we'll take using the `datascience` library is to group by the column we're interested in:

In [None]:
timit.group(["gender"])

We can also group by multiple columns at once. The following code shows us how many speakers from each gener appear in each region. For example, there are 921 female speakers from region 1:

In [None]:
timit.group(["gender", "region"])

This isn't the best way to visualize this, but we'll see a better way below when we learn about pivoting.

Try figuring out how many times each vowel appears in this dataset.

In [None]:
##

That was easy! We can now visualize this using `.plot()`. Sorting by `count` beforehand makes a nice graph.

In [None]:
vowels.sort('count', descending=True).barh('vowel', 'count')

We can also chain our commands, to create more complex combinations. For example, here's how we can produce the same graph as above but just for the female speakers:

In [None]:
timit.where("gender", "female").group(["vowel"]).sort('count', descending=True).barh('vowel', 'count')

## Pivoting

Earlier, we saw how to count the number of speakers in each gender and region combination. We used `Table.group()` to do this, but it gave us a result that wasn't that easy to interpret. Pivoting gives us another way to visualize this. We use `Table.pivot()` for this:

In [None]:
timit.pivot("gender", "region")

Now try pivoting the Table on the `gender` and `vowel` columns. What does that answer represent?

In [None]:
##