In [None]:
# imports -- just run this cell
import scipy
import numpy as np
import scipy.stats
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# Closure and Voice-Onset Time

In this notebook, we will be exploring the relations between the data the class submit, your data, and the data of a past class. 

## 1. Explore your own data
We will start by familiarizing ourselves with the data.

To visualize the data, we need to load the file first. Set "Fall2017_file" to the name of your data file by replacing the ellipsis (the 3 dots), to the location of the csv file. Then, run the code by pressing "Shift" + "Enter".

In [None]:
fall2017_file = 'data/vots.csv' #CHANGE TO ELLIPSIS IN ORIGINAL NOTEBOOK 
fa17 = pd.read_csv(fall2017_file)
fa17['Class'] = ['fall 2017'] * len (fa17)
fa17.head()

### 1.1 Data Visualization
Now that we have our information loaded, let's get a picture of the data with some plots.

#### COUNT: 
Below, we will see the number of each gender among the data we have.

In [None]:
ax = sns.countplot(x="gender", data=fa17)

Below, we will see the number of people who speak each language.

In [None]:
sns.countplot(y="language", data=fa17)

Below, we have a the distribution of height.

In [None]:
sns.distplot(fa17['height'])

### 1.2 Computing Basic Properties
We will compute the mean, median, mode, and range for closure.

In [None]:
fa17_closure_mode = scipy.stats.mode(fa17.closure)[0][0]
print('Mode: ', fa17_closure_mode)

fa17.closure.describe()

# 2. Explore relationships to metadata
Now let's explore relationships between closure and different characteristics of the persons who delivered those stats, specifically at language, gender, and height. We'll draw scatter plots to see whether there are linear relationships between them.

### 2.1 Language
Here, each dot is a person and you can see what language they speak and their respective closure measurement.

In [None]:
sns.stripplot(x="closure", y="language", data=fa17, jitter=True);

We can see that the majority speak English.
Question: if we try to come up with conclusion about people who speak Tagalog/Dutch, would the conclusions be reliable and why? (Type your answer below)

**Answer**:

Below, we take the graph above and visualize the mean of the closures of people who speak each language.

In [None]:
sns.violinplot(x="closure", y="language", data=fa17);

Compare the means. What do they tell you?

**Answer**:

### 2.2 Gender

In this section, we will dig into the relationship between closure and gender.

In [None]:
ax = sns.barplot(x="gender", y="closure", data=fa17)

What's the graph above showing? Can you analyze it? Type your answer below.

**Answer**:

Below, each dot is defined by gender on the y-axis and closure on the x-axis.

In [None]:
sns.stripplot(x="closure", y="gender", data=fa17, jitter=True);

How does men's average closure compare with women's? Does your conclusion fron this scatter plot agree with the bar chart we presented before? Type your answer below.

**Answer**:

### 2.3 Height

Now we'll look at how height influences closure.

In [None]:
sns.lmplot('closure', 'height', data=fa17, fit_reg=False)

In the scatter plot above, each dot is defined by closure and height. Change "fit_reg" to "True" in the code above to see the regression line.

What does this graph tell about the relationship between height and closure? Type your answer below.

**Answer**:

### Visualizing Multiple Features

So far, we've been presenting two kinds of information in one plot (eg. language vs. closure). Would presenting more than two at once help us at analyzing? Let's try it.

In [None]:
sns.lmplot(fit_reg=False, x="closure", y="height", hue="gender", palette="Set1", data=fa17)

Describe in your own words what each dot in the scatter plot above represents. Do you see some general patterns here? Type your answer below.

**Answer**:

Now, change the code `fit_reg=False` to fit_`reg=True` to see the regression line.


Regression lines generally describe a general trend of the data. What conclusions can you make by comparing the two regression lines? Type your answer below.

**Answer**:

Below, the color of the dots will depend on the language that person speaks rather than its gender.

In [None]:
sns.lmplot('closure', 'height', data=fa17, fit_reg=False, hue="language")

What conclusions can you make from the graph above? Is it easier to analyze this plot than the plot before? Why? Type your answer below.

**Answer**:

# 3. Compare data of entire class

### 3.1 Compare individual CLOSURE with the rest of class

Let's see where your own data stands in relation with the rest of the class. Change "myRowNumber" to the number of the row in the table that contains your data and run the cell.

In [None]:
myRowNumber = 7 #CHANGE TO ELLIPSIS IN ORIGINAL NOTEBOOK

In [None]:
xAxis = fa17['height'].tolist()
yAxis = fa17['closure'].tolist()

myHeight = xAxis.pop(myRowNumber)
myClosure = yAxis.pop(myRowNumber)

plt.plot(xAxis, yAxis, 'co')
plt.plot(myHeight, myClosure, "ro")

plt.show()

### 3.2 Compare our data with data from last semester

It's often useful to compare current data with past data. Below, we'll explore class data collected from last semester.

In [None]:
spring2017_file = 'data/vots.csv'
sp17 = pd.read_csv(spring2017_file)
sp17['Class'] = ['spring 2017'] * len (sp17)

sp17fa17 = sp17.append(fa17)
sp17fa17.head()

As before, we'll calculate the mean, mode, median, and range of last semester's closure and compare to ours.

In [None]:
fa17_closure_mean = np.mean(fa17.closure)
fa17_closure_mode = scipy.stats.mode(fa17.closure)[0][0]
fa17_closure_median = np.median(fa17.closure)
fa17_closure_range = [min(fa17.closure), max(fa17.closure)]

sp17_closure_mean = np.mean(sp17.closure)
sp17_closure_mode = scipy.stats.mode(sp17.closure)[0][0]
sp17_closure_median = np.median(sp17.closure)
sp17_closure_range = [min(sp17.closure), max(sp17.closure)]

df = pd.DataFrame()
df['Closure'] = ['mean', 'mode', 'median', 'range']
df['Spring 2017'] = [sp17_closure_mean, sp17_closure_mode, sp17_closure_median, sp17_closure_range]
df['Fall 2017'] = [fa17_closure_mean, fa17_closure_mode, fa17_closure_median, fa17_closure_range]
df

Let's check the closure mean by **gender** of last semester's class and compare to our class.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 8)
sns.barplot(x="gender", y="closure", data=sp17, ax=ax[0])
sns.barplot(x="gender", y="closure", data=fa17, ax=ax[1])

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.stripplot(x="closure", y="gender", data=sp17[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[0]).set_ylabel('')
sns.stripplot(x="closure", y="gender", data=fa17[sp17['gender'] != "Other / prefer not to answer"], jitter=True, ax=ax[1]).set_ylabel('')

Now let's look at their language and closure and compare to ours.

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(10, 4)
sns.violinplot(x="closure", y="language", data=fa17, ax=ax[0])
sns.violinplot(x="closure", y="language", data=sp17, ax=ax[1]).set_ylabel('')

Very interesting. How about the **height**?

In [None]:
ax = sns.lmplot('closure', 'height', data=sp17fa17, fit_reg=False, hue='Class')

Lastly, let's look at the relationship between **height, gender, and closure** and compare.

In [None]:
ax = sns.lmplot(x="closure", y="height",hue='gender', col="Class", data=sp17fa17[sp17fa17['gender'] != 'Other / prefer not to answer'])

Overall, how does our class data comprare with last semester's? Type your answer below.

**Answer**:

# 4. Place of Articulation

## 4.1 Closure

Let's see if we can find relationships between closure and places of articulation

First, we'll look at the **histogram** of closure, pclo, tclo, and kclo.

In [None]:
fa17.hist('closure', bins=np.arange(0, .225, .025))

Let's first visualize the places of articulation below.

In [None]:
ax = fa17.hist('pclo', bins=np.arange(0, .25, .025))

In [None]:
ax = fa17.hist('tclo', bins=np.arange(0, .2, .025))

In [None]:
ax = fa17.hist('kclo', bins=np.arange(0, .225, .025))

Now we have an idea of the distributions of the data we are exploring. Let's plot and compare each of the places of articulation.

**tclo and pclo:**

In [None]:
sns.lmplot('pclo', 'tclo', data=fa17)

**kclo and pclo:**

In [None]:
sns.lmplot('pclo', 'kclo', data=fa17)

**kclo and tclo:**

In [None]:
sns.lmplot('tclo', 'kclo', data=fa17)

Do you see any interesting relationships? Type your observation below.

**Answer**:

Here's one plot that that has all pclo, tclo, and kclo plotted against height.

In [None]:
height_3x = fa17.height.append(fa17.height).append(fa17.height)
ptk_closure = fa17.pclo.append(fa17.tclo).append(fa17.kclo)
closure_type = ['pclo'] * 87 + ['tclo'] * 87 + ['kclo'] * 87

c = {'height': height_3x, 'p/t/k Closure': ptk_closure, 'Closure type' : closure_type}
closure = pd.DataFrame(data=c)
#closure.head()

ax = sns.lmplot('p/t/k Closure', 'height', data=closure, fit_reg=False, hue="Closure type")

## 4.2 VOT

Let's first check out the numerical distribution of VOT.

In [None]:
ax = fa17.hist('vot', bins=np.arange(0, .3, .05))

Let's plot out the distributions of each articulation as well.

In [None]:
ax = fa17.hist('pvot', bins=np.arange(0, .2, .025))

In [None]:
ax = fa17.hist('tvot', bins=np.arange(0, .3, .025))

In [None]:
ax = fa17.hist('kvot', bins=np.arange(0, .7, .025))

Now we have an idea of the distributions of the data we are exploring. Let's plot and compare each of the places of articulation.

**tvot and pvot:**

In [None]:
sns.lmplot('pvot', 'tvot', data=fa17)

**kvot and pvot:**

In [None]:
sns.lmplot('pvot', 'kvot', data=fa17)

**kvot and tvot:**

In [None]:
sns.lmplot('tvot', 'kvot', data=fa17)

Do you see any interesting relationships? Type your observation below.

**Answer**:

(Optional) If you're interested, here's a scatter plot of height against each of the place of articulation.

In [None]:
ptk_vot = fa17.pvot.append(fa17.tvot).append(fa17.kvot)
vot_type = ['pvot'] * 87 + ['tvot'] * 87 + ['kvot'] * 87

v = {'height': height_3x, 'p/t/k VOT': ptk_vot, 'VOT type' : vot_type}
vot = pd.DataFrame(data=v)

ax = sns.lmplot('p/t/k VOT', 'height', data=vot, fit_reg=False, hue="VOT type")

# 5. Voiced stops (bvot, dvot, gvot)

In [None]:
# SIMULATING VOICED (bdg) VALUES FROM VOICELESS (ptk)
# SHOULD NOT BE PART OF THE NOTEBOOK
fa17['bvot'] = fa17.pvot + 0.05
fa17['dvot'] = fa17.tvot + 0.05
fa17['gvot'] = fa17.kvot + 0.05

We will begin by adding a column for average voiced and average voiceless for each person in our table:

In [None]:
fa17.rename(columns={'vot':'vot (ptk)'}, inplace=True)
fa17['vot (bdg)'] = fa17[['bvot', 'dvot', 'gvot']].mean(numeric_only=True, axis=1)
fa17.head()

Now, we will compare our data. First, let's look at the relationship between the average of the voiced (bdg) and the voiceless (ptk) place of articulation:

In [None]:
sns.lmplot('vot (ptk)', 'vot (bdg)', data=fa17, fit_reg=False)

In [None]:
sns.lmplot('vot (ptk)', 'height', data=fa17, fit_reg=False)

In [None]:
sns.lmplot('vot (bdg)', 'height', data=fa17, fit_reg=False)

What are some interesting relationships do you observe? Type your answer below:

**Answer**: 

# 6. Overall Observation

Share three interesting relationship you observed (ie. the relationship between closure and height, etc.). Explain the significance of each.

**First observation**:

**Second observation**:

**Third observation**: