# Linguistics 110: Closure and Voice-Onset Time

### Professor Susan Lin

This notebook will familiarize you with some of the basic strategies for data analysis that can be useful not only in this course, but possibly for the rest of your time at Cal. We will cover an overview of our computing environment, and then will explore the data on closure and VOT that you submit.

## Table of Contents

1 - [Computing Environment](#computing environment)

2 - [Creating our Dataframe](#dataframe)

3 - [Exploring the Data](#exploring data)

4 - [Relationships between Closures](#closures)

5 - [Exploring Metadata](#metadata)

6 - [Comparing to Others](#to class)

## 1. Our Computing Environment, Jupyter notebooks  <a id='computing environment'></a>
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

**Understanding Check 1** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button .  This sentence, for example, should be deleted.  So should this one.

### Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

Try running this cell:

In [None]:
print("Hello, World!")

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

### Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

### Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. Add a parenthesis to the end of the line to get rid of the error message.

Run the cell below so that we can get started on our module! These are our import statements (and a few other things). 

Because of the size of the Python community, if there is a function that you want to use, there is a good chance that someone has written one already and been kind enough to share their work in the form of packages. We can start using those packages by writing `import` and then the package name.

TODO: WRITE ABOUT THE TOOLBAR, SERVER TIME OUT, ETC.

In [None]:
# imports -- just run this cell
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import mode
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
%matplotlib inline

## 2. Creating our Dataframe <a id='dataframe'></a>
We will start by familiarizing ourselves with the data.

To visualize the data, we need to load the file first. In the line where we we assign `file_name` to equal the name of our dataset, which is a compilation of the results from the homework you completed last week.

Note that we have `data/` in front of the file name, which means that are file `example_data.csv` is in the `data` directory (folder).

In [None]:
file_name = 'data/example_data.csv'
data = pd.read_csv(file_name)
data.head()

### 2.1 Adding features from our data

We are going to add several columns to our dataframe. A column for each of the following:
+ The semester of this class (called `class`)
+ Average of all closure/vot for each individual (called `clo`/`vot`)
+ Average voiced closure/vot for each individual (called `vclo`/`vvot`)
+ Average voiceless closure/vot for each individual (called `vlclo`/`vlvot`)

First we are just going to add the column `class`. We will set it to be equal to `Fall 2017` for each row.

In [None]:
data['class'] = 'Fall 2017'
data.head()

Next we will add the column for the average of all of the closures for each row. First we will pull out just the columns that we want to take the average of.

In [None]:
subset = data[['pclo', 'tclo', 'kclo', 'bclo', 'dclo', 'gclo']]
subset.head()

Then we will take the average across those rows.

In [None]:
clo_avg = subset.mean(axis=1)
clo_avg

And finally, we will append those values to our dataframe as a column called `clo`.

In [None]:
data['clo'] = clo_avg
data.head()

We then repeat this process for all of the other columns that we want to create.

In [None]:
data['vot'] = data[['pvot', 'tvot', 'kvot', 'bvot', 'dvot', 'gvot']].mean(axis=1)
data['vclo'] = data[['pclo', 'tclo', 'kclo']].mean(axis=1)
data['vvot'] = data[['pvot', 'tvot', 'kvot']].mean(axis=1)
data['vlclo'] = data[['bclo', 'dclo', 'gclo']].mean(axis=1)
data['vlvot'] = data[['bvot', 'dvot', 'gvot']].mean(axis=1)
data.head()

# 3. Exploring the Data <a id='exploring data'></a>

### 3.1 Descriptive Statistics
Below we compute the some basic properties about the column `clo`.

In [None]:
closure_mode = mode(data.clo)[0][0]
print('Mode: ', closure_mode)

data.clo.describe()

### 3.2 Data Visualization
Now that we have our data in order, let's get a picture of the data with some plots.

Let's start by visualizing the distribution of `vot` with a histogram.

In [None]:
sns.distplot(data['vot'])

In [None]:
sns.distplot(data['vvot'])
sns.distplot(data['vlvot'])
plt.xlabel('ms')

In [None]:
# change back the third line before release (outlier was making it look )
sns.distplot(data['pvot'])
sns.distplot(data['tvot'])
sns.distplot(sorted(list(data['kvot']))[:-1])

In [None]:
sns.distplot(data['bvot'])
sns.distplot(data['dvot'])
sns.distplot(sorted(list(data['gvot']))[:-1])

#### COUNT: 
Below, we will see the number of people who speak each language.

In [None]:
sns.countplot(y="language", data=data)

Below, we have a the distribution of height.

In [None]:
sns.distplot(data['height'])

plt.xlabel('height (cm)')

# 4. Relationships between closures <a id='closures'></a>

TODO: add markdown and explanations to this whole section

### 4.1 Using x=y line

In [None]:
def plot_with_equality_line(xs, ys, best_fit=False):
    fig, ax = plt.subplots()
    sns.regplot(xs, ys, fit_reg=best_fit, ax=ax)

    lims = [np.min([ax.get_xlim(), ax.get_ylim()]), np.max([ax.get_xlim(), ax.get_ylim()])]
    ax.plot(lims, lims, '--', alpha=0.75, zorder=0, c='black')
    ax.set_xlim(lims)
    ax.set_ylim(lims)

In [None]:
plot_with_equality_line(data['tclo'], data['pclo'])

plt.xlabel('tclo (ms)')
plt.ylabel('pclo (ms)')

In [None]:
plot_with_equality_line(data['kclo'], data['pclo'])

plt.xlabel('kclo (ms)')
plt.ylabel('pclo (ms)')

In [None]:
plot_with_equality_line(data['kclo'], data['tclo'])

plt.xlabel('kclo (ms)')
plt.ylabel('tclo (ms)')

In [None]:
plot_with_equality_line(data['bclo'], data['dclo'])

plt.xlabel('bclo (ms)')
plt.ylabel('dclo (ms)')

In [None]:
plot_with_equality_line(data['bclo'], data['gclo'])

plt.xlabel('bclo (ms)')
plt.ylabel('gclo (ms)')

In [None]:
plot_with_equality_line(data['dclo'], data['gclo'])

plt.xlabel('dclo (ms)')
plt.ylabel('gclo (ms)')

### 4.2 Using box-and-whisker plots

In [None]:
sns.boxplot(data[['pclo', 'tclo', 'kclo']], width=.3, palette="Set3")

plt.ylabel('duration (ms)')
plt.xlabel('Voiceless Closures')

In [None]:
sns.boxplot(data[['bclo', 'dclo', 'gclo']], width=.3, palette="hls")

plt.ylabel('duration (ms)')
plt.xlabel('Voiced Closures')

# 5. Explore relationships to metadata <a id='metadata'></a>
Now let's explore relationships between closure and different characteristics of the persons who delivered those stats, looking at language and height. We'll draw scatter plots to see whether there are linear relationships between them.

### 5.1 Language
Before we look at the actual relationship, it is important to realize any potential limitations of our observations. If you look back up to the bar plot of different native languages, you will see that the majority speak English. 

Question: if we try to come up with conclusion about people who speak Tagalog/Dutch, would the conclusions be reliable and why? (Type your answer below)

**Answer**:

Here, each dot is a person and you can see what language they speak and their respective closure measurement.

In [None]:
sns.violinplot(x="clo", y="language", data=data)

plt.xlabel('clo (ms)')

Compare the distributions. Can you make any meaningful observations?

**Answer**:

### 5.2 Height

Now we'll look at how height influences closure.

In [None]:
sns.lmplot('clo', 'height', data=data, fit_reg=False)

plt.xlabel('clo (ms)')
plt.ylabel('height (cm)')

In the scatter plot above, represents an individual, and their corresponding average closure and height. 

Change "fit_reg" to "True" in the code above to see the regression line.

What does this graph tell about the relationship between height and closure? Regression lines describe a general trend of the data, sometimes refered to as the 'line of best fit'. Type your answer below.

**Answer**:

Let's see if there's a different kind of relationship between height and voiced/voiceless.

In [None]:
sns.regplot('vclo', 'height', data=data, fit_reg=True)
sns.regplot('vlclo', 'height', data=data, fit_reg=True)

plt.xlabel('clo (ms)')
plt.ylabel('height (cm)')

### 5.3 Visualizing Multiple Features

So far, we've been presenting two kinds of information in one plot (eg. language vs. closure). Would presenting more than two at once help us at analyzing? Let's try it.

**Answer**:

Below, the color of the dots will depend on the language that person speaks rather than its gender.

In [None]:
sns.lmplot('clo', 'height', data=data, fit_reg=False, hue="language")

plt.xlabel('clo (ms)')
plt.ylabel('height (cm)')

What conclusions can you make from the graph above, if any? Is it easy to analyze this plot? Why? Type your answer below.

**Answer**:

The lesson here is that sometimes less is more.

# 6. Compare data of entire class <a id='to class'></a>

### 6.1 Compare yourself to the rest of the class

Let's see where your own data stands in relation with the rest of the class. Change "myRowNumber" to the number of the row in the table that contains your data and run the cell.

In [None]:
myRowNumber = 1 #CHANGE TO ELLIPSIS IN ORIGINAL NOTEBOOK

In [None]:
xAxis = data['clo'].tolist()
yAxis = data['height'].tolist()

myClosure = xAxis.pop(myRowNumber)
myHeight = yAxis.pop(myRowNumber)

In [None]:
sns.distplot(data['clo'])
plt.axvline(myClosure, color='b', linestyle='dashed', linewidth=1)
plt.xlabel('clo (ms)')

In [None]:
sns.distplot(data['height'])
plt.axvline(myHeight, color='b', linestyle='dashed', linewidth=1)
plt.xlabel('height (cm)')

In [None]:
sns.kdeplot(data['clo'], data['height'], cmap="Blues", shade=True, shade_lowest=False)
plt.axvline(myClosure, color='b', linestyle='dashed', linewidth=1)
plt.axhline(myHeight, color='b', linestyle='dashed', linewidth=1)

plt.xlabel('clo (ms)')
plt.ylabel('height (cm)')

### 6.2 Compare our data with data from last semester

It's often useful to compare current data with past data. Below, we'll explore class data collected from last semester.

In [None]:
spring2017_file = 'data/vots.csv'
sp17 = pd.read_csv(spring2017_file)
sp17['Class'] = ['spring 2017'] * len (sp17)

sp17data = sp17.append(data)
sp17data.head()

As before, we'll calculate the mean, mode, median, and range of last semester's closure and compare to ours.

In [None]:
data_closure_mean = np.mean(data.closure)
data_closure_mode = scipy.stats.mode(data.closure)[0][0]
data_closure_median = np.median(data.closure)
data_closure_range = [min(data.closure), max(data.closure)]

sp17_closure_mean = np.mean(sp17.closure)
sp17_closure_mode = scipy.stats.mode(sp17.closure)[0][0]
sp17_closure_median = np.median(sp17.closure)
sp17_closure_range = [min(sp17.closure), max(sp17.closure)]

df = pd.DataFrame()
df['Closure'] = ['mean', 'mode', 'median', 'range']
df['Spring 2017'] = [sp17_closure_mean, sp17_closure_mode, sp17_closure_median, sp17_closure_range]
df['Fall 2017'] = [data_closure_mean, data_closure_mode, data_closure_median, data_closure_range]
df

Let's check the closure mean by **gender** of last semester's class and compare to our class.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 8)
sns.barplot(x="gender", y="closure", data=sp17, ax=ax[0])
sns.barplot(x="gender", y="closure", data=data, ax=ax[1])

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.stripplot(x="closure", y="gender", data=sp17[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[0]).set_ylabel('')
sns.stripplot(x="closure", y="gender", data=data[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[1]).set_ylabel('')

Now let's look at their language and closure and compare to ours.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.violinplot(x="closure", y="language", data=data, ax=ax[0])
sns.violinplot(x="closure", y="language", data=sp17, ax=ax[1]).set_ylabel('')

Very interesting. How about the **height**?

In [None]:
sns.lmplot('closure', 'height', data=sp17data, fit_reg=False, hue='Class')

Lastly, let's look at the relationship between **height, gender, and closure** and compare.

In [None]:
sns.lmplot(x="closure", y="height",hue='gender', col="Class", data=sp17data[sp17data['gender'] != 'Other / prefer not to answer'])

Overall, how does our class data comprare with last semester's? Type your answer below.

**Answer**:

(Optional) Here's one plot that that has all pclo, tclo, and kclo plotted against height.

In [None]:
height_3x = data.height.append(data.height).append(data.height)
ptk_closure = data.pclo.append(data.tclo).append(data.kclo)
closure_type = ['pclo'] * 87 + ['tclo'] * 87 + ['kclo'] * 87

c = {'height': height_3x, 'p/t/k Closure': ptk_closure, 'Closure type' : closure_type}
closure = pd.DataFrame(data=c)
#closure.head()

sns.lmplot('p/t/k Closure', 'height', data=closure, fit_reg=True, hue="Closure type")

(Optional) If you're interested, here's a scatter plot of height against each of the place of articulation.

In [None]:
height_3x = list(data['height']) * 3
ptk_vot = data.pvot.append(data.tvot).append(data.kvot)
vot_type = ['pvot'] * 87 + ['tvot'] * 87 + ['kvot'] * 87

v = {'height': height_3x, 'p/t/k VOT': ptk_vot, 'VOT type' : vot_type}
vot = pd.DataFrame(data=v)

sns.lmplot('p/t/k VOT', 'height', data=vot, fit_reg=False, hue="VOT type")

In [None]:
sns.lmplot('p/t/k VOT', 'height', data=vot, fit_reg=True, hue="VOT type")