# Analyzing UNH Software Carpentry workshop interest, registration, and attendance

In this notebook we will analyze some of the data collected from the [Software Carpentry workshop at the UNH School of Marine Science and Ocean Engineering](http://bsmith89.github.io/2015-08-27-unh/). From the workshop, we have anonymized data from initial interest emails, registration, and sign-in sheets. Our ultimate goal is to plot the per department and per job title counts for each time period in a stacked bar chart. Along the way we will clean up and manipulate the data, create plots, then get the plots to look right.

In [None]:
# Import some stuff we will need
# First, import from __future__ so our code will run on both Python 2 and 3
from __future__ import division, print_function
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Try to import Seaborn to make plots pretty
try:
    import seaborn
    seaborn.set(style="white", context="notebook", font_scale=1.5)
except ImportError:
    print("Cannot import Seaborn. Try:\n\n    conda install seaborn\n")

# See what data files we have available
os.listdir("data/anonymized/")

In [None]:
# Lets look at one raw dataset loaded as a pandas DataFrame
df = pd.read_csv("data/anonymized/signin_day1_department.csv")

# The last item executed in a cell will become the output. For a DataFrame, this
# will be a nicely rendered table
df

In [None]:
# Lets make a bar chart with the counts from each department
counts = df.department.value_counts()
counts.plot(kind="barh")
plt.show()

In [None]:
# Since departments are sometimes called by multiple names, we can write a function to make them all the same

def correct_department(dept_name):
    """
    Correct department names so there aren't apparent duplicates.
    """
    # Create a dictionary for department aliases that we can look up
    aliases = {"OE": "Ocean Engineering",
               "ME": "Mechanical Engineering",
               "Earth Science": "Earth Sciences",
               "EOS": "Earth Sciences",
               "OPAL": "Earth Sciences"}
    # Add some rules for fixing department names
    if dept_name in aliases.keys():
        return aliases[dept_name]
    elif " ".join(dept_name.split()[:2]) in aliases.keys(): # Matching first two words
        return aliases[" ".join(dept_name.split()[:2])]
    elif "oceanog" in dept_name.lower():
        return "Earth Sciences"
    elif "civil" in dept_name.lower():
        return "Civil Engineering"
    elif dept_name.isupper():
        return dept_name
    else:
        # Return the name formetted with title case
        return dept_name.title()

# Let's see how it works on the previously loaded DataFrame `df`
# Note we're using a "list comprehension" to create a list in one line
df.department = [correct_department(d) for d in df.department]

# Take a look at the value counts for the corrected DataFrame
df.department.value_counts()

In [None]:
# Let's wrap all of that into a function for loading data and correcting department name
def load_data(time="interested", quantity="department"):
    """
    Load CSV data from a specified time in 
    `["interested", "registered", "signin_day1", "signin_day2"]`
    then correct names and return a Series with value counts for the 
    specified quantity.
    """
    # Create a file name using Python's new style string formatting
    fname = "data/anonymized/{}_{}.csv".format(time, quantity)
    # Load CSV data using Pandas
    df = pd.read_csv(fname)
    # Correct department name
    if quantity == "department":
        df.department = [correct_department(d) for d in df.department]
    return df[quantity].value_counts()
            
# Test it out
load_data("interested", "department")

In [None]:
# We now need a function to fix the job title column
def correct_title(title):
    """
    Return properly formatted job title.
    """
    # Make sure title is a string
    title = str(title)
    if "grad student" in title.lower():
        return "Grad student"
    # We will group all professors together
    if "professor" in title.lower():
        return "Professor"
    else:
        return "Research staff"
    
# Let's write a test for that function
def test_correct_title():
    assert correct_title("Scientific Data Analyst") == "Research staff"
    assert correct_title("Research scientist") == "Research staff"
    assert correct_title("Grad student") == "Grad student"
    assert correct_title("Bob Johnson Professor of Awesome Stuff") == "Professor"
    print("Passed")

# Run the test
test_correct_title()

In [None]:
# Update the load data function to correct job title

def load_data(time="interested", quantity="department"):
    """
    Load CSV data from a specified time in 
    `["interested", "registered", "signin_day1", "signin_day2"]`
    then correct names and return a Series with value counts for the 
    specified quantity.
    """
    # Create a file name using Python's new style string formatting
    fname = "data/anonymized/{}_{}.csv".format(time, quantity)
    # Load CSV data using Pandas
    df = pd.read_csv(fname)
    # Correct department name
    if quantity == "department":
        df.department = [correct_department(d) for d in df.department]
    # Correct job title
    elif quantity == "title":
        df.title = [correct_title(t) for t in df.title]
    return df[quantity].value_counts()
            
# Test it out
load_data("interested", "title")

In [None]:
# Now we need a function to load data from all times
def load_all_times(quantity="department"):
    """
    Loads data for a specified quantity over all times.
    """
    # Create empty DataFrame
    df = pd.DataFrame()
    for time, timename in [("interested","Interested"), 
                           ("registered", "Registered"),
                           ("signin_day1", "Signed-in day 1"),
                           ("signin_day2", "Signed-in day 2")]:
        df[timename] = load_data(time, quantity)
    df.index.name = quantity.title()
    # Replace NaNs with zeros since these are counts
    df = df.fillna(0)
    return df

load_all_times(quantity="title")

In [None]:
# Lets try making a stacked bar chart to see the departmental representation

# Load the department data for all times
df = load_all_times("department")

# Create a figure and set of axes using matplotlib
fig, ax = plt.subplots()

# Plot the DataFrame as a stacked bar chart to the axes we just created
df.plot(ax=ax, kind="bar", stacked=True)

# Add a legend and put this outside the plot
# From http://matplotlib.org/users/legend_guide.html#legend-location
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In [None]:
# That last plot didn't look quite right. We wanted the x-axis to be a "pseudo-time."
# We can fix this by simply transposing the DataFrame before plotting.
# We also want the x-labels to be horizontal. Looking at the pandas docs we can see
# there is a `rot` parameter we can pass to the plot method

fig, ax = plt.subplots(figsize=(9, 5))
df.transpose().plot(ax=ax, kind="bar", stacked=True, rot=0)
ax.set_ylabel("Number of people")
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Call matplotlib.pyplot.show to avoid printing something like "<matplotlib.legend.Legend at 0x22a39e48>"
plt.show()

In [None]:
# Now let's wrap that plotting routine into a function with an option to save the figure
# to the `figures` directory with a specified format (see matplotlib docs for available formats)

# At first, the figures were not saving properly with the legend outside the axes
# See http://stackoverflow.com/questions/10101700/moving-matplotlib-legend-outside-of-the-axis-makes-it-cutoff-by-the-figure-box

def plot_all_times(quantity="department", save=False, savetype=".png"):
    """
    Loads data for all time for specified quantity into a DataFrame, then creates
    a stacked bar chart from these. 
    """
    df = load_all_times(quantity)
    fig, ax = plt.subplots(figsize=(9, 5))
    df.transpose().plot(ax=ax, kind="bar", stacked=True, rot=0)
    ax.set_ylabel("Number of people")
    legend = ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    if save:
        if not os.path.isdir("figures"):
            os.mkdir("figures")
        fname = os.path.join("figures", quantity + savetype)
        fig.savefig(fname, bbox_extra_artists=(legend,), bbox_inches="tight")

In [None]:
# Now let's plot both the departmental and job title representations and save some figures

plot_all_times("department", save=True, savetype=".png")
plot_all_times("title", save=True, savetype=".png")
plt.show()