# Linguistics 110: Closure and Voice-Onset Time

### Professor Susan Lin

This notebook will familiarize you with some of the basic strategies for data analysis that can be useful not only in this course, but possibly for the rest of your time at Cal. We will cover an overview of our computing environment, and then will explore the data on closure and VOT that you submit.

## Table of Contents

1 - [Computing Environment](#computing environment)

2 - [Exploring the data](#exploring data)

## Our Computing Environment, Jupyter notebooks  <a id='computing environment'></a>
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

**Understanding Check 1** This paragraph is in its own text cell.  Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button .  This sentence, for example, should be deleted.  So should this one.

### Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

Try running this cell:

In [None]:
print("Hello, World!")

The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

### Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

### Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. Add a parenthesis to the end of the line to get rid of the error message.

Run the cell below so that we can get started on our module! These are our import statements (and a few other things). 

Because of the size of the Python community, if there is a function that you want to use, there is a good chance that someone has written one already and been kind enough to share their work in the form of packages. We can start using those packages by writing `import` and then the package name.

TODO: WRITE ABOUT THE TOOLBAR, SERVER TIME OUT, ETC.

In [None]:
# imports -- just run this cell
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import mode
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

## 1. Exploring the data <a id='exploring data'></a>
We will start by familiarizing ourselves with the data.

To visualize the data, we need to load the file first. In the line where we we assign `file_name` to equal the name of our dataset, which is a compilation of the results from the homework you completed last week.

Note that we have `data/` in front of the file name, which means that are file `example_data.csv` is in the `data` directory (folder).

In [None]:
file_name = 'data/example_data.csv'
data = pd.read_csv(file_name)
data.head()

### Adding features from our data

We are going to add several columns to our dataframe. A column for each of the following:
+ The semester of this class (called `class`)
+ Average of all closure/vot for each individual (called `clo`/`vot`)
+ Average voiced closure/vot for each individual (called `vclo`/`vvot`)
+ Average voiceless closure/vot for each individual (called `vlclo`/`vlvot`)

First we are just going to add the column `class`. We will set it to be equal to `Fall 2017` for each row.

In [None]:
data['class'] = 'Fall 2017'
data.head()

Next we will add the column for the average of all of the closures for each row. First we will pull out just the columns that we want to take the average of.

In [None]:
subset = data[['pclo', 'tclo', 'kclo', 'bclo', 'dclo', 'gclo']]
subset.head()

Then we will take the average across those rows.

In [None]:
clo_avg = subset.mean(axis=1)
clo_avg

And finally, we will append those values to our dataframe as a column called `clo`.

In [None]:
data['clo'] = clo_avg
data.head()

We then repeat this process for all of the other columns that we want to create.

In [None]:
data['vot'] = data[['pvot', 'tvot', 'kvot', 'bvot', 'dvot', 'gvot']].mean(axis=1)
data['vclo'] = data[['pclo', 'tclo', 'kclo']].mean(axis=1)
data['vvot'] = data[['pvot', 'tvot', 'kvot']].mean(axis=1)
data['vlclo'] = data[['bclo', 'dclo', 'gclo']].mean(axis=1)
data['vlvot'] = data[['bvot', 'dvot', 'gvot']].mean(axis=1)
data.head()

### 1.1 Data Visualization
Now that we have our data in order, let's get a picture of the data with some plots.

#### COUNT: 
Below, we will see the number of people who speak each language.

In [None]:
sns.countplot(y="language", data=data)

Below, we have a the distribution of height.

In [None]:
sns.distplot(data['height'])

### 1.2 Descriptive Statistics
Below we compute the some basic properties about the column `clo`.

In [None]:
closure_mode = mode(data.clo)[0][0]
print('Mode: ', closure_mode)

data.clo.describe()

# 2. Explore relationships to metadata
Now let's explore relationships between closure and different characteristics of the persons who delivered those stats, specifically at language, gender, and height. We'll draw scatter plots to see whether there are linear relationships between them.

### 2.1 Language
Here, each dot is a person and you can see what language they speak and their respective closure measurement.

We can see that the majority speak English.
Question: if we try to come up with conclusion about people who speak Tagalog/Dutch, would the conclusions be reliable and why? (Type your answer below)

**Answer**:

Below, we take the graph above and visualize the mean of the closures of people who speak each language.

In [None]:
sns.violinplot(x="clo", y="language", data=data);

Compare the means. What do they tell you?

**Answer**:

### 2.2 Height

Now we'll look at how height influences closure.

In [None]:
sns.lmplot('clo', 'height', data=data, fit_reg=True)

In the scatter plot above, each dot is defined by closure and height. Change "fit_reg" to "True" in the code above to see the regression line.

What does this graph tell about the relationship between height and closure? Type your answer below.

**Answer**:

### Visualizing Multiple Features

So far, we've been presenting two kinds of information in one plot (eg. language vs. closure). Would presenting more than two at once help us at analyzing? Let's try it.

In [None]:
sns.lmplot(fit_reg=False, x="closure", y="height", hue="gender", palette="Set1", data=data)

Describe in your own words what each dot in the scatter plot above represents. Do you see some general patterns here? Type your answer below.

**Answer**:

Now, change the code `fit_reg=False` to fit_`reg=True` to see the regression line.


Regression lines generally describe a general trend of the data. What conclusions can you make by comparing the two regression lines? Type your answer below.

**Answer**:

Below, the color of the dots will depend on the language that person speaks rather than its gender.

In [None]:
sns.lmplot('closure', 'height', data=data, fit_reg=False, hue="language")

What conclusions can you make from the graph above? Is it easier to analyze this plot than the plot before? Why? Type your answer below.

**Answer**:

# 3. Compare data of entire class

### 3.1 Compare individual CLOSURE with the rest of class

Let's see where your own data stands in relation with the rest of the class. Change "myRowNumber" to the number of the row in the table that contains your data and run the cell.

In [None]:
myRowNumber = 7 #CHANGE TO ELLIPSIS IN ORIGINAL NOTEBOOK

In [None]:
xAxis = data['height'].tolist()
yAxis = data['closure'].tolist()

myHeight = xAxis.pop(myRowNumber)
myClosure = yAxis.pop(myRowNumber)

plt.plot(xAxis, yAxis, 'co')
plt.plot(myHeight, myClosure, "ro")

plt.show()

### 3.2 Compare our data with data from last semester

It's often useful to compare current data with past data. Below, we'll explore class data collected from last semester.

In [None]:
spring2017_file = 'data/vots.csv'
sp17 = pd.read_csv(spring2017_file)
sp17['Class'] = ['spring 2017'] * len (sp17)

sp17data = sp17.append(data)
sp17data.head()

As before, we'll calculate the mean, mode, median, and range of last semester's closure and compare to ours.

In [None]:
data_closure_mean = np.mean(data.closure)
data_closure_mode = scipy.stats.mode(data.closure)[0][0]
data_closure_median = np.median(data.closure)
data_closure_range = [min(data.closure), max(data.closure)]

sp17_closure_mean = np.mean(sp17.closure)
sp17_closure_mode = scipy.stats.mode(sp17.closure)[0][0]
sp17_closure_median = np.median(sp17.closure)
sp17_closure_range = [min(sp17.closure), max(sp17.closure)]

df = pd.DataFrame()
df['Closure'] = ['mean', 'mode', 'median', 'range']
df['Spring 2017'] = [sp17_closure_mean, sp17_closure_mode, sp17_closure_median, sp17_closure_range]
df['Fall 2017'] = [data_closure_mean, data_closure_mode, data_closure_median, data_closure_range]
df

Let's check the closure mean by **gender** of last semester's class and compare to our class.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 8)
sns.barplot(x="gender", y="closure", data=sp17, ax=ax[0])
sns.barplot(x="gender", y="closure", data=data, ax=ax[1])

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.stripplot(x="closure", y="gender", data=sp17[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[0]).set_ylabel('')
sns.stripplot(x="closure", y="gender", data=data[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[1]).set_ylabel('')

Now let's look at their language and closure and compare to ours.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.violinplot(x="closure", y="language", data=data, ax=ax[0])
sns.violinplot(x="closure", y="language", data=sp17, ax=ax[1]).set_ylabel('')

Very interesting. How about the **height**?

In [None]:
sns.lmplot('closure', 'height', data=sp17data, fit_reg=False, hue='Class')

Lastly, let's look at the relationship between **height, gender, and closure** and compare.

In [None]:
sns.lmplot(x="closure", y="height",hue='gender', col="Class", data=sp17data[sp17data['gender'] != 'Other / prefer not to answer'])

Overall, how does our class data comprare with last semester's? Type your answer below.

**Answer**:

# 4. Place of Articulation

## 4.1 Closure

Let's see if we can find relationships between closure and places of articulation

First, we'll look at the **histogram** of closure, pclo, tclo, and kclo.

In [None]:
data.hist('closure', bins=np.arange(0, .225, .025))

Let's first visualize the places of articulation below.

In [None]:
data.hist('pclo', bins=np.arange(0, .25, .025))

In [None]:
data.hist('tclo', bins=np.arange(0, .2, .025))

In [None]:
data.hist('kclo', bins=np.arange(0, .225, .025))

Now we have an idea of the distributions of the data we are exploring. Let's plot and compare each of the places of articulation.

**tclo and pclo:**

In [None]:
sns.lmplot('pclo', 'tclo', data=data)

**kclo and pclo:**

In [None]:
sns.lmplot('pclo', 'kclo', data=data)

**kclo and tclo:**

In [None]:
sns.lmplot('tclo', 'kclo', data=data)

Do you see any interesting relationships? Type your observation below.

**Answer**:

Here's one plot that that has all pclo, tclo, and kclo plotted against height.

In [None]:
height_3x = data.height.append(data.height).append(data.height)
ptk_closure = data.pclo.append(data.tclo).append(data.kclo)
closure_type = ['pclo'] * 87 + ['tclo'] * 87 + ['kclo'] * 87

c = {'height': height_3x, 'p/t/k Closure': ptk_closure, 'Closure type' : closure_type}
closure = pd.DataFrame(data=c)
#closure.head()

sns.lmplot('p/t/k Closure', 'height', data=closure, fit_reg=False, hue="Closure type")

## 4.2 VOT

Let's first check out the numerical distribution of VOT.

In [None]:
data.hist('vot', bins=np.arange(0, .3, .05))

Let's plot out the distributions of each articulation as well.

In [None]:
data.hist('pvot', bins=np.arange(0, .2, .025))

In [None]:
data.hist('tvot', bins=np.arange(0, .3, .025))

In [None]:
data.hist('kvot', bins=np.arange(0, .7, .025))

Now we have an idea of the distributions of the data we are exploring. Let's plot and compare each of the places of articulation.

**tvot and pvot:**

In [None]:
sns.lmplot('pvot', 'tvot', data=data)

**kvot and pvot:**

In [None]:
sns.lmplot('pvot', 'kvot', data=data)

**kvot and tvot:**

In [None]:
sns.lmplot('tvot', 'kvot', data=data)

Do you see any interesting relationships? Type your observation below.

**Answer**:

(Optional) If you're interested, here's a scatter plot of height against each of the place of articulation.

In [None]:
ptk_vot = data.pvot.append(data.tvot).append(data.kvot)
vot_type = ['pvot'] * 87 + ['tvot'] * 87 + ['kvot'] * 87

v = {'height': height_3x, 'p/t/k VOT': ptk_vot, 'VOT type' : vot_type}
vot = pd.DataFrame(data=v)

sns.lmplot('p/t/k VOT', 'height', data=vot, fit_reg=False, hue="VOT type")

# 5. Voiced stops (bvot, dvot, gvot)

In [None]:
# SIMULATING VOICED (bdg) VALUES FROM VOICELESS (ptk)
# SHOULD NOT BE PART OF THE NOTEBOOK
data['bvot'] = data.pvot + 0.05
data['dvot'] = data.tvot + 0.05
data['gvot'] = data.kvot + 0.05

We will begin by adding a column for average voiced and average voiceless for each person in our table:

In [None]:
data.rename(columns={'vot':'vot (ptk)'}, inplace=True)
data['vot (bdg)'] = data[['bvot', 'dvot', 'gvot']].mean(numeric_only=True, axis=1)
data.head()

Now, we will compare our data. First, let's look at the relationship between the average of the voiced (bdg) and the voiceless (ptk) place of articulation:

In [None]:
sns.lmplot('vot (ptk)', 'vot (bdg)', data=data, fit_reg=False)

In [None]:
sns.lmplot('vot (ptk)', 'height', data=data, fit_reg=False)

In [None]:
sns.lmplot('vot (bdg)', 'height', data=data, fit_reg=False)

What are some interesting relationships do you observe? Type your answer below:

**Answer**: 

# 6. Overall Observation

Share three interesting relationship you observed (ie. the relationship between closure and height, etc.). Explain the significance of each.

**First observation**:

**Second observation**:

**Third observation**: