# Visualizing Data
We are now ready to begin looking at visualizations of our data. In this lecture, you will learn how to create scatterplots, distribution plots, bar charts, and line graphs.

In [None]:
# run this cell; don't worry about what it does yet.
from datascience import *
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
%matplotlib inline

## Exploring TIMIT Data <a id='timit'></a>

Once again, our corpus for this week is [TIMIT](https://en.wikipedia.org/wiki/TIMIT), a database of speech recorded during telephone conversations in eight different varieties of American English. This database is specifically designed for phonetics and speech recognition research. For more information, visit [their website](https://catalog.ldc.upenn.edu/LDC93S1) (Garofolo et al. 1993).

First, we will upload the `.csv` file called `wk2-timit.csv`, which is located in the same folder as this notebook. Then, as we did in the last lecture, we will split the table into two smaller tables, one for male speakers and one for female speakers.

In [None]:
timit = Table.read_table('wk2-timit.csv')
ipa = Table().with_columns(
'vowel', make_array('AA','AE','AH','AO','EH','ER','EY','IH','IY','OW','UH','UW'),
'vowel_IPA', make_array('a','æ','ʌ','ɔ','ɛ','ɚ','eɪ','ɪ','i','oʊ','ʊ','u'))
t = timit.join('vowel',ipa)
t_male = t.where('gender','male')
t_female = t.where('gender','female')

In [None]:
t_female.show(5)

## Scatterplots
For our visualizations, we are using functions from two libraries called `Seaborn` and `Matplotlib`. You can find out more about Seaborn [here](https://seaborn.pydata.org/index.html) and Matplotlib [here](https://matplotlib.org/). I will demonstrate the basic Python method to create each visualization, followed by the more complex way using Seaborn or Matplotlib. (When you ran the cell at the very beginning of this notebook, you loaded Seaborn, Matplotlib, and all of their relevant functions through the shortcuts `sns` and `plt`, respectively.)

Scatterplots are a visualization technique for continuous data. To make a scatterplot from a Table object, you can use the method `.scatter()` or the function `plt.scatter()`.

The method `.scatter()` is simpler, but less customizable. The first argument is the column of data that will go on the x-axis, and the second argument is the data for the y-axis. The axis labels will come from the column names. Try it below:

In [None]:
t_male.scatter('height','f0')

The `plt.scatter()` function is more complex, but also more flexible. First, the basic version of the same plot as above:

In [None]:
plt.scatter(x='height',y='f0',data=t_male)

Notice that when you use `plt.scatter()`, you have the option of defining the arguments `x` and `y`. This is not essential, but often it is helpful for you to remember what the arguments are; it is also helpful for other people reading your code!

However, the main difference between the `Matplotlib` function `plt.scatter()` and the basic method `.scatter()` is that in `plt.scatter()`, you must include an argument that tells the function where to find the `data`.

In addition, with `plt.scatter()`, you can change and add labels on successive lines in the cell:

In [None]:
plt.scatter(x='height',y='f0',data=t_male)
plt.xlabel('Height (cm)') #label for the x-axis
plt.ylabel('fundamental frequency (Hz)') # label for the y-axis
plt.title('f0 and height for TIMIT speakers (male)'); # title for the plot

Immediately you can see that in the data, there are many values of 0 for `f0`. These are likely to be values created by measurement error. (It would be unusual for an American English vowel to have a fundamental frequency, or pitch, of 0 Hertz.) How might you go about creating a subset of your tables such that the values of f0 that are equal to 0 are removed?

## Distribution Plots

Next, we will look at distribution plots (or histograms). We want to inspect the distributions of F1 and F2 in `t_male` and `t_female` to identify possible trends or relationships. Having our two split dataframes, `t_female` and `t_male`, simplifies the visualization process, and using `Seaborn`, we can overlay one plot on top of the other to allow for a visual comparison.

Now, what are F1 and F2? The "F" stands for "formant", and the formants (first, second, third, etc.) are acoustic properties of vowels. The first formant (F1) and the second formant (F2) happen to correspond very closely to the phonological attributes of vowels that we explored last week: high, low, front, and back. To be specific, a higher value for F1 generally means that a vowel is low and not high, while a lower value for F1 means that a vowel is high and not low. A higher value for F2 generally means that a vowel is front and not back, while a lower value for F2 means that a vowel is back and not front. Vowels can also have relatively "middle-range" values for F1 and F2, making them neither high or low, or neither front nor back. We'll see in more detail what this looks like later. But let's look at just one formant at a time for now.

Distribution plots can be made using the function `sns.distplot()`. The main argument of `sns.distplot()` is a one-dimensional array, which in this case is equal to one column in a `Table`. So, let's create arrays for the values in the column `F1` for female speakers and male speakers:

In [None]:
af = t_female.column('F1')
am = t_male.column('F1')

Which then become the sole argument of `sns.distplot()`.

In [None]:
sns.distplot(af)

We add the argument `kde_kws` to include a label. Run this cell below, but don't worry too much about the structure of this argument.

In [None]:
sns.distplot(af, kde_kws={"label":"female"})

Adding lines below the initial line of code but *within the same cell* allows you to "overlay" other plots and axis labels to your original plot.

In [None]:
sns.distplot(af, kde_kws={"label": "female"})
sns.distplot(am, kde_kws={"label": "male"})
plt.title('F1 of TIMIT speakers (male and female)')
plt.xlabel('Hz')
plt.ylabel('Proportion per Hz');

Describe the distributions of F1 for each group. Which group has, on average, higher values for F1?

Not only do humans vary by gender in their F1 measurements, but each individual vowel also varies in its "standard" value for F1 (and F2). As discussed earlier, high vowels tend to have a lower F1, and low vowels tend to have a higher F1. (Isn't that confusing?) But we can illustrate this using distribution plots.

First, create subsets of the data for specific vowels.

In [None]:
iy = t_male.where('vowel','IY')
ae = t_male.where('vowel','AE')
aa = t_male.where('vowel','AA')

Next, create arrays for the F1 data in each table.

In [None]:
a_iy = iy.column('F1')
a_ae = ae.column('F1')
a_aa = aa.column('F1')

Finally, create distribution plots for each array and overlay them, adding labels and a title.

In [None]:
sns.distplot(a_iy, kde_kws={'label':'IY'})
sns.distplot(a_ae, kde_kws={'label':'AE'})
sns.distplot(a_aa, kde_kws={'label':'AA'})
plt.title('F1 of IY, AE, and AA of TIMIT speakers (male)')
plt.xlabel('Hz')
plt.ylabel('Proportion per Hz');

From this visualization, you can see how the IY vowel (as in "fleece") has a lower F1 than the other two vowels. Also, although AE (as in "trap") and AA (as in "palm") have quite a bit of overlap, it's easy to see that AA has a slightly higher average F1 value, which corresponds to it being a lower vowel. How might you calculate these averages?

## Vowel Plots
F1 and F2 are important characteristics of vowels, but it is admittedly difficult to understand them without a better visualization. The most common way to visualize vowels happens to be a type of scatterplot called a vowel plot. Using a few pre-made functions, we are now going to make a vowel plot using the TIMIT data!

We are going to be recreating the following graphic from [All Things Linguistic](http://allthingslinguistic.com/post/67308552090/how-to-remember-the-ipa-vowel-chart) (which, by the way, is an excellent resource for bite-sized linguistics lessons and links to other linguistics-learning websites).

![](notblank.png)

So, first, recall that the input for a scatterplot using `sns.scatter()` is a value for the x-axis and a value for the y-axis. In a vowel plot, F2 is plotted on the x-axis, and F1 is plotted on the y-axis. Each formant is also reversed on its axis and fitted to a logarithmic scale. (Don't worry too much about why or how to do this.)

What we need to do is create a table that has each vowel in one row, and one value of F1 and F2 for each vowel. How might we pick the F1/F2 value to use? We could use the maximum or minimum value in the column, but I think it also makes sense to use the mean. So, we're going to use the function `np.mean()` to create the data we need.

We have to do this separately for each vowel. (Yes, it is possible to do this faster and/or more efficiently with other Python tools, but those tools will not be taught now. You already know everything you need to know in order to do this!)

Step 1: Create an array that contains all of the F1 values of a single vowel. In this case, we will look at the vowel AA from the male TIMIT speakers using the method `.where()`, and then we will add another method, `.column()`, to select the column `F1`.

In [None]:
t_male.where('vowel','AA').column('F1')

Step 2: Use `np.mean()` to calculate the mean (average) of the values in the array.

In [None]:
np.mean(t_male.where('vowel','AA').column('F1'))

Step 3: Rinse, wash, repeat for all of the vowels in `t_male` (`np.unique(t_male.column('vowel'))` may help!)

In [None]:
AA = np.mean(t_male.where('vowel','AA').column('F1'))
AE = np.mean(t_male.where('vowel','AE').column('F1'))
AH = np.mean(t_male.where('vowel','AH').column('F1'))
AO = np.mean(t_male.where('vowel','AO').column('F1'))
EH = np.mean(t_male.where('vowel','EH').column('F1'))
ER = np.mean(t_male.where('vowel','ER').column('F1'))
EY = np.mean(t_male.where('vowel','EY').column('F1'))
IH = np.mean(t_male.where('vowel','IH').column('F1'))
IY = np.mean(t_male.where('vowel','IY').column('F1'))
OW = np.mean(t_male.where('vowel','OW').column('F1'))
UH = np.mean(t_male.where('vowel','UH').column('F1'))
UW = np.mean(t_male.where('vowel','UW').column('F1'))

Step 4: Put all of your values together in an array, and then use the function `np.log()` to transform the data into the logarithmic scale.

In [None]:
mean_F1 = make_array(AA,AE,AH,AO,EH,ER,EY,IH,IY,OW,UH,UW)
mean_F1_log = np.log(mean_F1)

Step 5: Repeat for F2.

In [None]:
AA2 = np.mean(timit.where('vowel','AA').column('F2'))
AE2 = np.mean(timit.where('vowel','AE').column('F2'))
AH2 = np.mean(timit.where('vowel','AH').column('F2'))
AO2 = np.mean(timit.where('vowel','AO').column('F2'))
EH2 = np.mean(timit.where('vowel','EH').column('F2'))
ER2 = np.mean(timit.where('vowel','ER').column('F2'))
EY2 = np.mean(timit.where('vowel','EY').column('F2'))
IH2 = np.mean(timit.where('vowel','IH').column('F2'))
IY2 = np.mean(timit.where('vowel','IY').column('F2'))
OW2 = np.mean(timit.where('vowel','OW').column('F2'))
UH2 = np.mean(timit.where('vowel','UH').column('F2'))
UW2 = np.mean(timit.where('vowel','UW').column('F2'))
mean_F2 = make_array(AA2,AE2,AH2,AO2,EH2,ER2,EY2,IH2,IY2,OW2,UH2,UW2)
mean_F2_log = np.log(mean_F2)

Step 6: Combine the two arrays into a new Table called `mean_formants`. You'll notice that after log transformation, the F1 and F2 values are no longer in Hertz.

In [None]:
mean_formants = Table().with_columns('vowel',np.unique(timit.column('vowel')),
                                     'F1',mean_F1_log,
                                     'F2',mean_F2_log)
mean_formants

Step 7: Run the cell below. These are a nifty little pair of functions that will plot your vowels. You do not need to know how these functions work at this point in the course. But if you look closely at the code, do you notice a familiar function?

In [None]:
def plot_blank_vowel_chart():
    im = plt.imread('blankvowel.png')
    plt.imshow(im, extent=(plt.xlim()[0], plt.xlim()[1], plt.ylim()[0], plt.ylim()[1]))

def plot_vowel_space(avgs_df):
    plt.figure(figsize=(10, 8))
    plt.gca().invert_yaxis()
    plt.gca().invert_xaxis()
    
    vowels = ['a','æ','ʌ','ɔ','ɛ','ɚ','eɪ','ɪ','i','oʊ','ʊ','u']
    
    for i in range(avgs_df.num_rows):
        plt.scatter(avgs_df.column('F2')[i], avgs_df.column('F1')[i], marker=r"$ {} $".format(vowels[i]), s=1000)
        
    plt.ylabel('F1')
    plt.xlabel('F2')

Step 8: Finally, run the function with its sole argument: your table `mean_formants`! Because the function's base function is, in fact, `plt.scatter()`, it is really just another `Matplotlib` plot that will take extra lines to amend the axis labels and title.

In [None]:
plot_vowel_space(mean_formants)
plt.xlabel('log(F2) (Hz)')
plt.ylabel('log(F1) (Hz)')
plt.title('Mean vowel formants for TIMIT speakers (male)');

(Optional Step 9:) Finally, we are going to overlay a blank vowel space chart outline to see how close our data reflects the theoretical vowel chart.

In [None]:
plot_vowel_space(mean_formants)
plot_blank_vowel_chart()
plt.xlabel('log(F2) (Hz)')
plt.ylabel('log(F1) (Hz)')
plt.title('Mean vowel formants for TIMIT speakers (male)');

How well does it match the original?

## Line graph
Line graphs are another way to represent two numerical data types. This often makes the most sense when the data is a time series, or something that changes across time. Our non-linguistic example uses census data for the city of Berkeley.

In [None]:
b = Table().with_columns('Year',make_array(1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010),
                         'Population',make_array(40434,56036,82109,85547,113805,111268,114091,103328,102724,102743,112580))
b

To make a line graph from a Table object, use the method `.plot()` or the `Matplotlib` equivalent function `plt.plot()`.

In [None]:
b.plot('Year','Population')

In [None]:
# Or, using Matplotlib:
plt.plot('Year','Population',data=b)
plt.title('Population growth of Berkeley, CA')
plt.xlabel('Year')
plt.ylabel('Population')

## Bar charts
Finally, bar charts are useful for visualizing counts of data when one of the variables is discrete (e.g., a bunch of strings).

In [None]:
c = Table().with_columns('School',make_array("Harvard","Cambridge","UC Berkeley","Chicago","MIT","Columbia","Stanford","CalTech","Oxford","Princeton"),
                         'Number of Nobel Laureates',make_array(160, 120, 107, 100, 97, 96, 83, 74, 72, 68))
c.sort('School')

To make a bar chart from a Table object, use the methods `.bar()` and `.barh()` or the `Matplotlib` equivalent functions `plt.bar()` and `plt.barh()`.

In [None]:
c.bar('School', 'Number of Nobel Laureates')

In [None]:
c.barh('School', 'Number of Nobel Laureates') # The 'h' stands for 'horizontal'

In [None]:
# Or, using Matplotlib:
plt.barh('School','Number of Nobel Laureates',data=c)
plt.title('Which institution is the most pretentious?')
plt.xlabel('Number of Nobel Laureates (1901-2019)')
plt.ylabel('Institution')