# Contributor Analysis

This notebook ingests the preprocessed data from `../interim/metadata` downloaded by `download_datasets.ipynb` and quantifies the activity of individual contributors to the mailing list, including frequency of initial senders to a thread, and frequency of all replies to existing threads in the mailing list. Both of these analyses are performed on monthly intervals.

Finally, the analyses are merged and saved as a single csv file that is pushed to remote storage for visualization with Superset. 

In [1]:
import pandas as pd
import os
import datetime
import re
from pathlib import Path
from dotenv import load_dotenv

load_dotenv("../../.env")

import sys

sys.path.append("../..")
from src import utils

In [2]:
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data/")

In [3]:
df = utils.load_dataset(f"{BASE_PATH}/interim/metadata/")

In [4]:
df.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,In-Reply-To
0,<1519862707.18745.0@posteo.de>,"Wed, 28 Feb 2018 18:05:07 -0600",mcatanzaro at gnome.org,devel at lists.fedoraproject.org,Re: Test gating enabled in Bodhi,CAM86y2s-RnNmKDpQGQK12Y1VHGx-142AzFOgzc4oDqy7w...
1,<b88b4d3b-b8f7-2ea6-ec41-ab572b831717@dustymab...,"Wed, 28 Feb 2018 19:43:15 -0500",Dusty Mabe <dusty at dustymabe.com>,devel at lists.fedoraproject.org,Re: Unannounced soname bump (Rawhide): qpdf (l...,20180228183836.GB17774@redhat.com
2,<45117c81-43ff-656d-85c3-3cf003bd0d14@fedorapr...,"Wed, 28 Feb 2018 20:49:25 -0500",Randy Barlow <bowlofeggs at fedoraproject.org>,devel at lists.fedoraproject.org,Re: Test gating enabled in Bodhi,1519862707.18745.0@posteo.de
3,<CAA55FSN-R4oV0os0LihZQTp8aa0NkR9jQPh44subK2+9...,"Wed, 28 Feb 2018 21:11:19 -0500",Orcan Ogetbil <oget.fedora at gmail.com>,devel at lists.fedoraproject.org,Re: <DKIM> Re: <DKIM> Re: <DKIM> Re: [ACTION N...,0c2eeac1ff3aad727094f24c1c704d22@laposte.net
4,<p77nrn$t3b$1@blaine.gmane.org>,"Thu, 01 Mar 2018 03:19:34 +0100",Kevin Kofler <kevin.kofler at chello.at>,devel at lists.fedoraproject.org,Frequently broken Rawhide/Branched composes (w...,CAB-QmhQrFTSLwt0dvqgpNkbKa7zVum9KnpRVAa33o4OMA...


##  Minor preprocessing

Here we need to do some minor cleaning to the "subject", "text" and "date" fields to correctly rearrange the dataframe so that all messages from the same thread are grouped together. 

In [5]:
# get all participants regradles of response
def match_threads(subject):
    return re.sub(r"^{0}".format(re.escape("Re: ")), "", subject)


# remove trailing emials
def remove_email(text):
    return re.sub(r" <.*", "", text)


# convert date string to datetime object
def parse_date(date):
    return pd.to_datetime(date)


def asked_or_responded(in_reply_to):
    return int(pd.isna(in_reply_to))

In [6]:
# apply our transformations

df["Subject"] = df["Subject"].apply(match_threads)
df["From"] = df["From"].apply(remove_email)
df["Date"] = df["Date"].apply(parse_date)
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df["Asked"] = df["In-Reply-To"].apply(asked_or_responded)
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,In-Reply-To,Chunk,Asked
0,<57d76b65c8c848f7e1b83e56ff8f094ce3855479.came...,2017-12-31 18:42:21-08:00,Adam Williamson,devel at lists.fedoraproject.org,Anything we can do to temporarily halt new bug...,20171231163359.GL17340@redhat.com,2017-12-01,0
1,<20180101172438.GA2871@flame.pingoured.fr>,2018-01-01 18:24:38+01:00,Pierre-Yves Chibon,devel at lists.fedoraproject.org,[Bug 1529276] New: findbugs-contrib-7.2.0.sb i...,20171231200124.15b636ef@noname,2018-01-01,0
2,<20180101220004.0632660A400B@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,nils at redhat.com,devel at lists.fedoraproject.org,[Fedocal] Reminder meeting : Modularity Office...,,2018-01-01,1
3,<20180101220004.0E97560A400C@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,nils at redhat.com,devel at lists.fedoraproject.org,[Fedocal] Reminder meeting : Modularity Office...,,2018-01-01,1
4,<20180101221314.GA52721@rawhide-composer.phx2....,2018-01-01 22:13:15+00:00,Fedora Rawhide Report,devel at lists.fedoraproject.org,Fedora rawhide compose report: 20180101.n.0 ch...,,2018-01-01,1


## Quantify contributor activity 

In [7]:
# Quantify askers over the entire dataset
askers = (
    df[df["Asked"] == True]  # noqa: E712
    .groupby("From")
    .count()
    .sort_values("Subject", ascending=False)
)
askers.head(15)

Unnamed: 0_level_0,Message-ID,Date,To,Subject,In-Reply-To,Chunk,Asked
From,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fedora compose checker,1781,1781,1781,1781,0,1781,1781
Fedora Rawhide Report,697,697,697,697,0,697,697
Ben Cotton,397,397,397,397,0,397,397
=?utf-8?q?Miro_Hron=C4=8Dok_=3Cmhroncok_at_redhat=2Ecom=3E?=,311,311,311,311,0,311,311
Adam Williamson,267,267,267,267,0,267,267
Fedora Branched Report,252,252,252,252,0,252,252
rawhide at fedoraproject.org,182,182,182,182,0,182,182
Fabio Valentini,153,153,153,153,0,153,153
Richard Shaw,145,145,145,145,0,145,145
Jan Kurik,132,132,132,132,0,132,132


In [8]:
# Quantify responders over the entire dataset

responders = (
    df[df["Asked"] == False]  # noqa: E712
    .groupby("From")
    .count()
    .sort_values("Subject", ascending=False)
)
responders.head(15)

Unnamed: 0_level_0,Message-ID,Date,To,Subject,In-Reply-To,Chunk,Asked
From,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
=?utf-8?q?Miro_Hron=C4=8Dok_=3Cmhroncok_at_redhat=2Ecom=3E?=,1636,1636,1636,1636,1636,1636,1636
Neal Gompa,973,973,973,973,973,973,973
Adam Williamson,888,888,888,888,888,888,888
=?utf-8?q?Zbigniew_J=C4=99drzejewski-Szmek_=3Czbyszek_at_in=2Ewaw=2Epl=3E?=,887,887,887,887,887,887,887
Kevin Fenzi,853,853,853,853,853,853,853
Kevin Kofler,740,740,740,740,740,740,740
Chris Murphy,682,682,682,682,682,682,682
Fabio Valentini,664,664,664,664,664,664,664
=?utf-8?q?V=C3=ADt_Ondruch_=3Cvondruch_at_redhat=2Ecom=3E?=,580,580,580,580,580,580,580
Florian Weimer,574,574,574,574,574,574,574


In [9]:
# Generate results for each contributor for each month

contributors = df["From"].unique()

db = {}
for c in contributors:
    participated = df[df["From"] == c].groupby("Chunk")["Asked"].count()
    started = df[df["From"] == c].groupby("Chunk")["Asked"].sum()
    responded = participated - started
    name = pd.Series([c] * len(responded))
    db[c] = pd.concat([started, responded], axis=1)
    db[c]["name"] = c
    db[c].columns = ["asked", "responded", "name"]

In [10]:
contributor_data_set = pd.DataFrame(columns=["asked", "responded", "name"])
for c in db.keys():
    contributor_data_set = pd.concat([contributor_data_set, db[c]])

In [11]:
contributor_data_set.shape

(7819, 3)

In [12]:
contributor_data_set = contributor_data_set.reset_index().set_index("name")

In [13]:
contributor_data_set.rename(columns={"index": "date"}, inplace=True)

In [14]:
contributor_data_set

Unnamed: 0_level_0,date,asked,responded
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adam Williamson,2017-12-01,0,1
Adam Williamson,2018-01-01,12,49
Adam Williamson,2018-02-01,8,25
Adam Williamson,2018-03-01,16,61
Adam Williamson,2018-04-01,9,23
...,...,...,...
Dylan M Taylor,2020-11-01,0,4
Wim Taymans,2020-11-01,0,2
Ondrej Pohorelsky,2020-11-01,1,1
Ed Neville,2020-11-01,1,0


## Upload results to S3

In [15]:
new_files = (
    (contributor_data_set, f"{BASE_PATH}/processed/contributors.csv"),
)

In [16]:
Path(f"{BASE_PATH}/processed").mkdir(parents=True, exist_ok=True)

In [17]:
contributor_data_set.to_csv(new_files[0][1], header=False)

In [18]:
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/{Path(f).stem}/contributors.csv") for _, f in new_files
    )