# Applying Functions
In this lecture, we will apply what we know about user-defined functions to the analysis of sociolinguistic data, helping us with visualizations, string manipulation, and analysis.

In [None]:
from datascience import *
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Sociolinguistic Data

The data we will look at is from a *sociolinguistic* study of the pronunciation of [str].

The data were collected by David Durian and popularized in Keith Johnson's textbook *Quantitative Methods in Linguistics*. The data were collected using the *Rapid Anonymous Survey technique*: individuals were asked a question to prompt a particular linguistic variable. In this study, the researcher asked store clerks in Columbus, Ohio for directions, prompting them to say either the word "street" or "straight". The researcher pretended to mishear them, ("Excuse me, what did you say?"), leading the stranger to say the same word more emphatically. The researcher impressionalistically wrote down (out of sight) whether or not the speaker had said [str] or [ʃtr] ("shtr"). The researchers also noted the perceived gender and age of the speaker, as well as where they encountered the speaker, which could be used as a proxy for the speaker's socioeconomic class.

In [None]:
m = Table.read_table("wk5-str.csv")
m.show(10)

A description of each of the column labels:

- str: whether the speaker said "str" or "shtr" (that is [str] or [ʃtr])
    - strnumbers: the same as above but coded in binary (0 or 1)
- emphatic: whether the context was more emphatic (i.e., whether it was after "Excuse me, what did you say?") or less (i.e., at the first prompting).
- gender: perceived gender of the participant
- age: perceived age of the speaker as young (15-30), mid (35-50) or old (55-70)
    - ageletters: the same as above, but coded as letters
- Mall: the name of the mall the store clerk worked at
    - region: the same as above, but coded as numbers
- store: the name of the store the store clerk worked at
- class: codes economic class with three levels (Working Class, Lower Middle Class and Upper Middle Class)
    - classnumbers: the same as above but coded as numbers
- bank: codes economic class using five levels (Middle Working Class, Upper Working Middle Class, Lower Middle Class, Mid-Middle Class, Upper Middle Class)
    - banknumbers: the same as above but coded as numbers

The variables "style" and "job" will not be used here.

Now, suppose we wanted to see which variant of [str] participants used in each of the conditions (more or less emphatic), we might start with statistical summaries and a visualization.

In [None]:
x = m.where('emphatic','less').column('strnumbers').sum()
print(x,"observations among those with the 'less emphatic' condition have a value of '1' in the column 'strnumbers'.")

In [None]:
x = np.count_nonzero(m.where('emphatic','less').column('str')=='shtr')
print(x,"observations among those with the 'less emphatic' condition have a value of 'shtr' in the column 'str'.")

In [None]:
ratio = x / m.where('emphatic','less').num_rows * 100
print(round(ratio,2),"% of the observations among those with the 'less emphatic' condition have a value of 'shtr' in the column 'str'.")

Let's create a barplot to count up instances of each variant. We'll use the `sns.countplot()` function from `seaborn` (not to be confused with their `barplot`, which does not count instances) to do this.

In [None]:
sns.countplot(x=m.column('str'))
plt.title("Instances of [str] variants")
plt.xlabel("[str] variant")
_=plt.ylabel("instances (count)")

Look like there are more occurences of "str" than "shtr". But this, of course, doesn't visualize the condition: more or less emphatic. That's crucial to our analysis. Adding the argument `hue` can sort the values in 'str' by color, while we separate the data into two columns based on 'emphatic'.

In [None]:
sns.countplot(x=m.column('emphatic'),
              hue=m.column('str'))
plt.title("Instances of [str] variants by emphatic condition")
plt.xlabel("emphatic condition")
_=plt.ylabel("instances (count)")

But there are many variables here that might play a role in the [str] variant, including gender and class. Can we split this plot into several plots based on speaker gender? There are many ways to do this, but one straightforward way to do this which you've seen before is to use `.where()` to subset the data first, then create two side-by-side plots.

In [None]:
plt.figure(figsize=(10,5))
plt.suptitle("Instances of [str] variants by gender and emphatic condition")
plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
aplot = sns.countplot(x=m.where('gender','w').column('emphatic'),
              hue=m.where('gender','w').column('str'))
aplot.set(title="female speakers",xlabel="emphatic condition",ylabel="count")

plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
bplot = sns.countplot(x=m.where('gender','m').column('emphatic'),
              hue=m.where('gender','m').column('str'))
_=bplot.set(title="male speakers",xlabel="emphatic condition",ylabel="count")

We can also compare the values from the different stores at which people were surveyed.

In [None]:
m.group(['participant number','store']).group('store').show(35)

In [None]:
plt.figure(figsize=(10,5))
plt.suptitle("Instances of [str] variants by store and emphatic condition")
plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
aplot = sns.countplot(x=m.where('store','walmart').column('emphatic'),
              hue=m.where('store','walmart').column('str'))
aplot.set(title="Walmart workers",xlabel="emphatic condition",ylabel="count")

plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
bplot = sns.countplot(x=m.where('store','kauffmans').column('emphatic'),
              hue=m.where('store','kauffmans').column('str'))
_=bplot.set(title="Kauffmans workers",xlabel="emphatic condition",ylabel="count")

## Introducing a plotting function
There are so many stores to compare! I don't want to retype 'Walmart' three times in a new cell every time I want to analyze and visualize it. Here's where we're going to simplify things but putting the code above into a custom function, such that "walmart" can just called once as an argument.

In [None]:
def plot_store(x):
    '''Plot the str/shtr counts of speakers at store x, compared to Kauffmans.'''
    plt.figure(figsize=(10,5))
    plt.suptitle("Instances of [str] variants by store and emphatic condition")
    plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
    aplot = sns.countplot(x=m.where('store',x).column('emphatic'),   # replaced 'walmart' with 'x'
              hue=m.where('store',x).column('str'))                  # replaced 'walmart' with 'x'
    aplot.set(title="{} workers".format(x),                          # replaced 'walmart' with 'x' using .format()
              xlabel="emphatic condition",
              ylabel="count") 
    
    plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
    bplot = sns.countplot(x=m.where('store','kauffmans').column('emphatic'),
              hue=m.where('store','kauffmans').column('str'))
    _=bplot.set(title="Kauffmans workers",xlabel="emphatic condition",ylabel="count")
    print("Comparison of",x,"to Kauffman's.")

In [None]:
plot_store("walmart")

Now, we can use any string present as a variable in 'store' as the argument to `plot_store()`

In [None]:
plot_store("ccfoodct")

Be careful with your plots! The ccfoodct plot above has the reverse of the colors of the kauffmans plot. The `catplot()` function and `FacetGrid`() function are possible ways to fix this error, but they are beyond the scope of this course. Just be aware of the output of your plots, and read them carefully to make sure that they align logically with your understanding of the data.

We're going to adjust `plot_store()` a little now and have it take two arguments, one for each store.

In [None]:
def plot_stores(x,y):
    '''Plot the str/shtr counts of speakers at stores x and y.'''
    plt.figure(figsize=(10,5))
    plt.suptitle("Instances of [str] variants by store and emphatic condition")
    plt.subplot(1,2,1) # 1 row, 2 columns, plot #1
    aplot = sns.countplot(x=m.where('store',x).column('emphatic'),   # replaced 'walmart' with 'x'
              hue=m.where('store',x).column('str'))                  # replaced 'walmart' with 'x'
    aplot.set(title="{} workers".format(x),                          # replaced 'walmart' with 'x' using .format()
              xlabel="emphatic condition",
              ylabel="count") 
    
    plt.subplot(1,2,2) # 1 row, 2 columns, plot #2
    bplot = sns.countplot(x=m.where('store',y).column('emphatic'),
              hue=m.where('store',y).column('str'))
    _=bplot.set(title="{} workers".format(y),xlabel="emphatic condition",ylabel="count")
    print("Comparison of",x,"to",y)

In [None]:
plot_stores('lazarus','sears')

## Functions that summarize data
We can do other things besides plot figures with our functions, of course.

In [None]:
subset=m.where('Mall','Easton')
subset.group('participant number').num_rows

In [None]:
def summarize_mall(mall):
    '''Calculate the number of speakers surveyed at a mall.'''
    subset = m.where('Mall',mall)
    num = subset.group('participant number').num_rows
    print("There were",num,"participants at the",mall,"mall.")
    return num

In [None]:
summarize_mall('Easton')

Note! After a bit of sleuthing, I've discovered a small error in our data! Participant 96 appears to have teleported from CityCenter to Polaris in the middle of data collection. This is why it is important to spot check your data for inconsistences before you run any analyses. However, errors are also just a part of science. Normally, we would just remove the data from participant 96 at the very beginning. (Or, if we knew the study design a bit better, we could assume that the extra 'Polaris' was supposed to be 'CityCenter'. Not our study, though, so we can't know for sure.) In the rest of the lecture and in this week's homework, however, I will not require a "fixing" of this particular error.

In [None]:
m.where('participant number',96)

In [None]:
def summarize_mall(mall):
    '''Calculate the number of speakers surveyed at a mall and their str/shtr ratios by gender.'''
    subset = m.where('Mall',mall)
    num = subset.group('participant number').num_rows
    male = subset.where('gender','m')
    female = subset.where('gender','w')
    nm = male.group('participant number').num_rows
    nf = female.group('participant number').num_rows
    xm = np.count_nonzero(male.where('emphatic','less').column('str')=='shtr') / male.where('emphatic','less').num_rows * 100
    xf = np.count_nonzero(female.where('emphatic','less').column('str')=='shtr') / female.where('emphatic','less').num_rows * 100
    summary = "At {} Mall, {}% of male speakers (n={}) and {}% of female speakers (n={}) used 'shtr' instead of 'str' when surveyed in the 'less emphatic' condition."
    return summary.format(mall,round(xm,2),nm,round(xf,2),nf)

In [None]:
summarize_mall("Easton")

In [None]:
summarize_mall("Polaris")

## Applying functions to Table columns

We already know how to use functions inside of `for` loops, but often, a faster way to do the same thing is to use the `.apply()` method, when you want to apply a function to all values in a column of a Table. We've used the `.apply()` method before, specifically for string manipulation. Here's another example!

In [None]:
def caps(name):
    return name.capitalize()

In [None]:
m.apply(caps,'store')

And again, but this time treating 'udf' (United Dairy Farmers) specially using a conditional statement:

In [None]:
def caps(name):
    if name == 'udf':
        return name.upper()
    else:
        return name.capitalize()

In [None]:
m.apply(caps,'store')

In [None]:
m = m.with_column('store_caps',m.apply(caps,'store'))
m

## Booleans for comparing observations
This last section may come in handy if you want to compare two observations (rows) to one another.

In [None]:
p = m.where('participant number',5)
p

Here is our boolean, which compares the value in 'str' in one subset to the value in 'str' to another subset. Note that the output is an array, since we've compared arrays (with one item each).

In [None]:
p.where('emphatic','less').column('str') == p.where('emphatic','more').column('str')

A simpler way to do the same thing using bracket indexing, since we know that every subset `p` will always have two rows. This version returns a bare `bool`, not an array.

In [None]:
p.column('str')[0] == p.column('str')[1]

In [None]:
np.unique(m.column('participant number'))

In [None]:
count = make_array()
for person in np.unique(m.column('participant number')):
    p = m.where('participant number',person)
    if p.column('str')[0] == p.column('str')[1]:
        t = "same"
        count = np.append(count, t)
    else:
        t = "diff"
        count = np.append(count, t)
print("The array of Booleans (True/False or 1/0) representing whether str (less) == str (more):")
print(count)
print("Total number of False:",np.count_nonzero(count=="diff"))

Further simplifying...

In [None]:
count = make_array()
for person in np.unique(m.column('participant number')):
    p = m.where('participant number',person)
    b = p.column('str')[0] == p.column('str')[1]
    count = np.append(count,b)
print("The array of Booleans (True/False or 1/0) representing whether str (less) == str (more):")
print(count)
print("Total number of False:",np.count_nonzero(count==0))

## Grouping and summing
This is just a reminder that the `.group()` method can also take additional arguments, such as `sum`, which change what `.group()` does after it groups. Rather than just counting up the number of observations, `sum` will sum up the values in the ungrouped columns. In the example below, 'strnumbers' consists of integers, so they can be added together (summed up).

In [None]:
type(m.column('strnumbers')[0])

Thus, after dropping unnecessary columns and grouping by the columns in the list, `sum` will sum up the values in 'strnumbers'. (It also sums up the values in 'emphatic', but since those are strings, the output is blank.)

In [None]:
n = m.drop("str","style","ageletters","Mall","banknumbers","bank","job","classnumbers","class","store_caps").\
    group(['gender','region','age','store','participant number'],sum).sort("participant number")
n

This could be useful in comparing participants rates of 'str' versus 'shtr', but only because the data type of 'strnumbers' is an integer with binary values.

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
aplot = sns.barplot(x=n.where('gender','w').column('age'),
                    y=n.where('gender','w').column('strnumbers sum'))
aplot.set(title="female speakers",xlabel="age group",ylabel="str/shtr value")

plt.subplot(1,2,2, sharey=aplot) # sharey to make the y-axis the same as 'aplot'
bplot = sns.barplot(x=n.where('gender','m').column('age'),
                    y=n.where('gender','m').column('strnumbers sum'))
_=bplot.set(title="male speakers",xlabel="age group",ylabel="str/shtr value")

print("str/shtr value indicates here how many times the speaker used 'shtr' (between 0 and 2 total possible.)\n\
Thus, on average, young female speakers were most likely to use 'shtr' (average 1 time), while old female \n\
speakers were the least likely.")