This notebook compares the email activities and draft activites of an IETF working group.

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.

In [1]:
import bigbang.mailman as mailman
from bigbang.parse import get_date

# from bigbang.functions import *
from bigbang.archive import Archive

from ietfdata.datatracker import *

Also, let's import a number of other dependencies we'll use later.

In [2]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

## Load the HRPC Mailing List

Now let's load the email data for analysis.

In [3]:
wg = "hrpc"

urls = [wg]

archives = [Archive(url, mbox=True) for url in urls]

activities = [arx.get_activity(resolved=False) for arx in archives]
activity = activities[0]



## Load IETF Draft Data

Next, we will use the `ietfdata` tracker to look at the frequency of drafts for this working group.

In [5]:
from ietfdata.datatracker import *
from ietfdata.datatracker_ext import *

import pandas as pd

dt = DataTracker()

g = dt.group_from_acronym("hrpc")
drafts = [
    draft
    for draft in dt.documents(group=g, doctype=dt.document_type_from_slug("draft"))
]


draft_df = pd.DataFrame.from_records(
    [{"time": draft.time, "title": draft.title, "id": draft.id} for draft in drafts]
)

We will want to use the data of the drafts. Time resolution is too small.

In [7]:
draft_df["date"] = draft_df["time"].dt.date

## Gender score and tendency measures

This notebook uses the (notably imperfect) method of using first names to guess the gender of each draft author.

In [None]:
from gender_detector import gender_detector as gd

detector = gd.GenderDetector("us")


def gender_score(name):
    """
    Takes a full name and returns a score for the guessed
    gender.

    1 - male
    0 - female
    .5 - unknown
    """
    try:
        first_name = name.split(" ")[0]
        guess = detector.guess(first_name)
        score = 0
        if guess == "male":
            return 1.0
        elif guess == "female":
            return 0.0
        else:
            # name does not have confidence to guesss
            return 0.5
    except:
        # Some error, "unknown"
        return 0.5

## Gender guesses on mailing list activity

Now to use the gender guesser to track the contributions by differently gendered participants over time.

In [None]:
from bigbang.parse import clean_name

In [None]:
gender_activity = (
    activity.groupby(by=lambda x: gender_score(clean_name(x)), axis=1)
    .sum()
    .rename({0.0: "women", 0.5: "unknown", 1.0: "men"}, axis="columns")
)

Note that our gender scoring method currently is unable to get a clear guess for a large percentage of the emails!

In [None]:
print(
    "%f.2 percent of emails are from an unknown gender."
    % (gender_activity["unknown"].sum() / gender_activity.sum().sum())
)

plt.bar(["women", "unknown", "men"], gender_activity.sum())
plt.title("Total emails sent by guessed gender")

## Plotting

Some preprocessing is necessary to get the drafts data ready for plotting.

In [None]:
from matplotlib import cm

viridis = cm.get_cmap("viridis")

In [None]:
drafts_per_day = draft_df.groupby("date").count()["title"]

For each of the mailing lists we are looking at, plot the rolling average (over `window`) of number of emails sent per day.

Then plot a vertical line with the height of the drafts count and colored by the gender tendency.

In [None]:
window = 100

In [None]:
plt.figure(figsize=(12, 6))

for i, gender in enumerate(gender_activity.columns):
    colors = [viridis(0), viridis(0.5), viridis(0.99)]

    ta = gender_activity[gender]
    rmta = ta.rolling(window).mean()
    rmtadna = rmta.dropna()
    plt.plot_date(
        np.array(rmtadna.index),
        np.array(rmtadna.values),
        color=colors[i],
        linestyle="-",
        marker=None,
        label="%s email activity - %s" % (wg, gender),
        xdate=True,
    )


vax = plt.vlines(
    drafts_per_day.index,
    0,
    drafts_per_day,
    colors="r",  # draft_gt_per_day,
    cmap="viridis",
    label=f"{wg} drafts ({drafts_per_day.sum()} total)",
)

plt.legend()
plt.title(f"{wg} working group emails and drafts")
# plt.colorbar(vax, label = "more womanly <-- Gender Tendency --> more manly")

# plt.savefig("activites-marked.png")
# plt.show()

### Is gender diversity correlated with draft output?



In [None]:
from scipy.stats import pearsonr
import pandas as pd


def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how="outer")
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

In [None]:
drafts_per_ordinal_day = pd.Series(
    {x[0].toordinal(): x[1] for x in drafts_per_day.items()}
)

In [None]:
drafts_per_ordinal_day

In [None]:
ta.rolling(window).mean()

In [None]:
garm = np.log1p(gender_activity.rolling(window).mean())

## Measuring diversity

As a rough measure of gender diversity, we sum the mailing list activity of women and those of unidentified gender, and divide by the activity of men.

In [None]:
garm["diversity"] = (garm["unknown"] + garm["women"]) / garm["men"]

In [None]:
garm["drafts"] = drafts_per_ordinal_day
garm["drafts"] = garm["drafts"].fillna(0)

In [None]:
garm.corr(method="pearson")

In [None]:
calculate_pvalues(garm)

Some variations...

In [None]:
garm_dna = garm.dropna(subset=["drafts"])