# Keyword Analysis on Fedora Mailing List using Watson NLU

This notebook demonstrates analyzing the Fedora mailing list using Watson Natural Language Understanding - using default models.

This is based on the Watson Discovery Tutorial at https://github.com/spackows/CASCON-2019_NLP-workshops/blob/master/notebooks/Notebook-1_Exploring-NLU.ipynb

In [60]:
import pandas as pd
import numpy as np
import re
import os
from src import utils
import datetime
from collections import defaultdict
from pathlib import Path

from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_watson.natural_language_understanding_v1 import (
    Features,
    KeywordsOptions,
    SemanticRolesOptions,
)

In [61]:
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")

## Step 1: Look up Natural Language Understanding API key and URL

1. From the **Navigation menu** ( <img style="margin: 0px; padding: 0px; display: inline;" src="https://github.com/spackows/CASCON-2019_NLP-workshops/raw/master/images/nav-menu-icon.png"/> ), under the **Services** group, right-click "Watson Services" and then open the link in a new browser tab
2. In the new Watson services tab, from the **Action** menu beside your Natural Language Understanding instance, select "Manage in IBM Cloud"
3. In the service details page that opens, copy the apikey and URL

In [121]:
apikey = ""  # <-- PASTE YOUR APIKEY HERE
url = ""  # <-- PASTE YOUR SERVICE URL HERE

The NLU API can be used to extract:
- Sentiment
- Emotion
- Keywords
- Entities
- Categories
- Concepts
- Syntax
- Semantics

In [63]:
# Instantiate a natural language understanding object
authenticator = IAMAuthenticator(apikey)
nlu = NaturalLanguageUnderstandingV1(
    version="2018-11-16", authenticator=authenticator
)
nlu.set_service_url(url)

## Step 2: Import Fedora Email List

In [64]:
df = utils.load_dataset(f"{BASE_PATH}/interim/text/")
df.head()

Unnamed: 0,Message-ID,Date,Body
0,<23f4b2992626d689b84a704a575d974cc794709e.came...,"Fri, 31 Jul 2020 18:41:49 -0600","['On Fri, 2020-07-31 at 19:26 +0100, Richard W..."
1,<CAB_b4sBOn9Bisre7D3pUrDmH9+3unoP5VaeRGi031ks3...,"Sat, 01 Aug 2020 11:07:52 +0800",['Jerry James <loganjerry(a)gmail.com> =E4=BA=...
2,<CAJP_izdx=xTviDd4piWMLvxua7Ti8wD81kwqFEB7ucbG...,"Sat, 01 Aug 2020 03:25:48 -0400","['libcroco was retired on Rawhide, but the lib..."
3,<rg3f65$16fd$1@ciao.gmane.io>,"Sat, 01 Aug 2020 12:12:21 +0200","['Hi,\n\nseeing the amount of fallout from LTO..."
4,<rg3fi2$ipa$1@ciao.gmane.io>,"Sat, 01 Aug 2020 12:18:41 +0200",['Neal Gompa wrote:\n> I think it does have va...


In [65]:
df = df[150:200]

Only taking a small sample of the dataframe due to 2 reasons:
* On the Free IBM Watson Account, we are allowed only a limited number of queries
* Some emails have code and other unclean text which all cause the following watson_analyze() function to throw an error

#TODO: Import data cleaning scripts from auto-faq project to overcome above error

In [66]:
# Cleaning function to be separated from this nb as a part of
# https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/41
def strip_thread(text):
    text = text.replace("\r", "")
    lines = text.split("\n")
    lines = [line for line in lines if len(line) > 0]
    lines = [line for line in lines if line[0] != ">"]
    lines = [line for line in lines if line[:3] != "Re:"]
    lines = [line for line in lines if line[:7] != "Subject"]
    lines = [line for line in lines if line[:5] != "From:"]
    lines = [line for line in lines if line[:5] != "Date:"]
    lines = [line for line in lines if "BEGIN PGP SIGNED MESSAGE" not in line]
    lines = [line for line in lines if line[:5] != "Hash:"]
    lines = [line for line in lines if line[:10] != "Version: G"]
    lines = [line for line in lines if "wrote:" not in line]
    lines = [line for line in lines if "wrote :" not in line]
    lines = [line for line in lines if "writes:" not in line]
    lines = [line for line in lines if line[:7] != "Am Mit,"]
    lines = [line for line in lines if line[:7] != "Am Don,"]
    lines = [line for line in lines if line[:7] != "Am Mon,"]
    lines = [line for line in lines if line[:7] != "Quoting"]
    lines = [line for line in lines if line[:10] != "Em Quinta,"]
    lines = [line for line in lines if "said:" not in line]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), .. (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            (
                ".*n (Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday) .."
                " (January|February|March|April|May|June|July|August|September|October|November|December) 20..*"
            ),
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) .., 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            r".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), 20[\d]{2}-[\d]{2}-[\d]{2} at.*",
            line,
        )
        is None
    ]
    lines = [line for line in lines if line[-6:] != "said: "]
    lines = [line for line in lines if line[-8:] != "babbled:"]
    lines = [line for line in lines if line[-7:] != "wrot=e:"]
    lines = [line for line in lines if line[-8:] != "A9crit :"]
    lines = [line for line in lines if line[0] != "|"]
    return "\n".join(lines)


# format for CSV, clean special characters, and remove extranous emails
def pandas_clean(emails):
    emails["Body"].replace(
        to_replace=[
            r"\n",
            "\n",
        ],
        value=" ",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\'", "'", ">", "<", "= ", "-", r"http\S+"],
        value="",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\\\s+", r"\\s+", "="], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["_", "3D"], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"] = emails["Body"].apply(
        lambda x: x.strip().replace(r"\n", "")
    )

    emails.drop(emails.index[emails["Body"] == ""], inplace=True)
    emails.drop(emails.index[emails["Body"] == " "], inplace=True)
    emails.dropna(subset=["Body"], inplace=True)

    emails = emails.reset_index()
    emails.drop("index", axis=1, inplace=True)
    return emails

In [67]:
clean = df.copy()
clean["Body"] = df["Body"].apply(strip_thread)
clean = pandas_clean(clean)
clean

Unnamed: 0,Message-ID,Date,Body
0,<20200803214201.GG15536@redhat.com>,"Mon, 03 Aug 2020 22:42:01 +0100","[""I cant reproduce this locally but it happens..."
1,<20200803215342.GH15536@redhat.com>,"Mon, 03 Aug 2020 22:53:42 +0100","["" to the latest build, but thats the ELN buil..."
2,<20200804002941.i26sbstikzielo6t@mandor.scrye....,"Mon, 03 Aug 2020 17:29:41 -0700","[""ok. I did what I could with the resources we..."
3,<CAHzpm2hyYOALhb4Or1We1EbDi+SHk_ADpo8+fMXooruL...,"Tue, 04 Aug 2020 12:02:50 +1000","[""Hi,The upstream author of wayvnc (a VNC serv..."
4,<CAN3TeO3B-_TUiYmnHgibWOB_p_Qvyo_+d+oTRkmON1nG...,"Mon, 03 Aug 2020 22:14:24 -0500","[""Sometimes you need to get into the build dir..."
5,<20200804061443.GA30379@nautica>,"Tue, 04 Aug 2020 08:14:43 +0200","[Hi,this is more of a head up than a bug per s..."
6,<CAO9z1z_Yo84sDLSONGnGA7da_cemp98YBujLxWRffyvQ...,"Tue, 04 Aug 2020 00:49:38 -0600","[At today blocker review meeting[0], we ran ac..."
7,<20200804074640.GI15536@redhat.com>,"Tue, 04 Aug 2020 08:46:40 +0100","[""I disabled debuginfo generation in ocamlppxt..."
8,<20200804075445.20516.19410@mailman01.iad2.fed...,"Tue, 04 Aug 2020 07:54:45 +0000","[""Kevin, thanks for caring about TrojitC3A1. J..."
9,<CAB_b4sBDsGQG-eA=dV=pa8fV+o6Ee3_vsB4qygncTgF2...,"Tue, 04 Aug 2020 16:39:02 +0800",[Qiyu Yan yanqiyu(a)fedoraproject.org E4BA8E20...


In [68]:
clean["Date"] = clean["Date"].apply(lambda x: pd.to_datetime(x))
clean["Chunk"] = clean["Date"].apply(
    lambda x: datetime.date(x.year, x.month, 1)
)
clean = clean.sort_values(by="Date")
clean.reset_index(inplace=True, drop=True)
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk
0,<20200803214201.GG15536@redhat.com>,2020-08-03 22:42:01+01:00,"[""I cant reproduce this locally but it happens...",2020-08-01
1,<20200803215342.GH15536@redhat.com>,2020-08-03 22:53:42+01:00,"["" to the latest build, but thats the ELN buil...",2020-08-01
2,<20200804002941.i26sbstikzielo6t@mandor.scrye....,2020-08-03 17:29:41-07:00,"[""ok. I did what I could with the resources we...",2020-08-01
3,<CAHzpm2hyYOALhb4Or1We1EbDi+SHk_ADpo8+fMXooruL...,2020-08-04 12:02:50+10:00,"[""Hi,The upstream author of wayvnc (a VNC serv...",2020-08-01
4,<CAN3TeO3B-_TUiYmnHgibWOB_p_Qvyo_+d+oTRkmON1nG...,2020-08-03 22:14:24-05:00,"[""Sometimes you need to get into the build dir...",2020-08-01


## Step 3: Analyze sample customer messages

For our analysis, we'll focus on extracting:
- Keywords 
- Actions and Objects (from semantic roles)

In [69]:
def watson_analyze(message):

    """
    Extract keywords and sematic roles from text
    Input : message
    Output : Array of action verbs, Array of Keywords
    """
    result = nlu.analyze(
        text=message,
        features=Features(
            keywords=KeywordsOptions(), semantic_roles=SemanticRolesOptions()
        ),
    ).get_result()
    actions_arr = []
    keywords_arr = []
    for keyword in result["keywords"]:
        keywords_arr.append(keyword["text"])
    if "semantic_roles" in result:
        for semantic_result in result["semantic_roles"]:
            if "action" in semantic_result:
                actions_arr.append(semantic_result["action"]["normalized"])

    return pd.Series([actions_arr, keywords_arr])

In [70]:
clean["Actions"] = np.nan
clean["Keywords"] = np.nan

In [71]:
clean[["Actions", "Keywords"]] = clean["Body"].apply(
    lambda x: watson_analyze(x)
)

In [72]:
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk,Actions,Keywords
0,<20200803214201.GG15536@redhat.com>,2020-08-03 22:42:01+01:00,"[""I cant reproduce this locally but it happens...",2020-08-01,"[cant reproduce, be]","[Richard Jones, Tiny program, Red Hat, Virtual..."
1,<20200803215342.GH15536@redhat.com>,2020-08-03 22:53:42+01:00,"["" to the latest build, but thats the ELN buil...",2020-08-01,"[thats, build, fail.I]","[latest build, ELN build, Richard Jones, lates..."
2,<20200804002941.i26sbstikzielo6t@mandor.scrye....,2020-08-03 17:29:41-07:00,"[""ok. I did what I could with the resources we...",2020-08-01,"[do, have, notice, use, lower, move, have noti...","[varnish package cache, kvm instances, vm gues..."
3,<CAHzpm2hyYOALhb4Or1We1EbDi+SHk_ADpo8+fMXooruL...,2020-08-04 12:02:50+10:00,"[""Hi,The upstream author of wayvnc (a VNC serv...",2020-08-01,[],"[upstream author of wayvnc, new release wayvnc..."
4,<CAN3TeO3B-_TUiYmnHgibWOB_p_Qvyo_+d+oTRkmON1nG...,2020-08-03 22:14:24-05:00,"[""Sometimes you need to get into the build dir...",2020-08-01,"[need, use, have to rely, to find, have]","[...%{vpathbuilddir, build directory, case for..."


## Step 4: Aggregate Keywords by Month


Here we aggregate the keywords extracted using watson NLU for each email by month and create a long dataframe with the columns `month`, `word`, and `count` which can be used for making plots.

Note : To observe monthly trends, we need to be able to run the analysis on a larger sample of the data which will let us get more months and keywords to analyze.

For that, we need a Paid account to be able to run a larger query as well as for us to succesfully run the analysis without an error, we have to be able to clean the dataset to remove the code fragments

In [110]:
months = set()
monthly_dict_list = []
month_keywords = defaultdict(int)

for index, row in clean.iterrows():

    for word in row["Keywords"]:
        month_keywords[word] += 1
        month_keywords[(str(row["Chunk"]), str(word))] = month_keywords.pop(
            word
        )

    if row["Chunk"] not in months:

        months.add(row["Chunk"])

        month_keywords = dict(
            sorted(month_keywords.items(), key=lambda item: item[1])
        )

        monthly_dict_list.append(month_keywords)
        month_keywords = defaultdict(int)

In [111]:
monthly_dict_list

[{('2020-08-01', 'Richard Jones'): 1,
  ('2020-08-01', 'Tiny program'): 1,
  ('2020-08-01', 'Red Hat'): 1,
  ('2020-08-01', 'Virtualization Group'): 1,
  ('2020-08-01', 'Bad exit status'): 1,
  ('2020-08-01', 'Permission deniederror'): 1,
  ('2020-08-01', 'bin'): 1,
  ('2020-08-01', 'manypowerful monitoring features'): 1,
  ('2020-08-01', 'strip'): 1,
  ('2020-08-01', 'virtual machines'): 1,
  ('2020-08-01', 'file'): 1,
  ('2020-08-01', 'Koji'): 1,
  ('2020-08-01', 'lib'): 1,
  ('2020-08-01', 'rpm'): 1,
  ('2020-08-01', 'virtualization blog'): 1,
  ('2020-08-01', 'reason'): 1,
  ('2020-08-01', 'usr'): 1,
  ('2020-08-01', 'programming'): 1,
  ('2020-08-01', 'var'): 1,
  ('2020-08-01', 'net stats'): 1,
  ('2020-08-01', 'ideas'): 1,
  ('2020-08-01', '5A3ApT'): 1,
  ('2020-08-01', 'disk stats'): 1,
  ('2020-08-01', 'builddir'): 1,
  ('2020-08-01', 'build'): 1,
  ('2020-08-01', 'BUILDROOT'): 1,
  ('2020-08-01', 'brpstrip'): 1,
  ('2020-08-01', 'fc33'): 1,
  ('2020-08-01', 'omake'): 1,
  ('2

In [112]:
monthly_words_df = pd.DataFrame(
    [
        {"month": key[0], "word": key[1], "count": value}
        for i in range(len(monthly_dict_list))
        for (key), value in monthly_dict_list[i].items()
    ]
)

In [114]:
monthly_words_df.head(10)

Unnamed: 0,month,word,count
0,2020-08-01,Richard Jones,1
1,2020-08-01,Tiny program,1
2,2020-08-01,Red Hat,1
3,2020-08-01,Virtualization Group,1
4,2020-08-01,Bad exit status,1
5,2020-08-01,Permission deniederror,1
6,2020-08-01,bin,1
7,2020-08-01,manypowerful monitoring features,1
8,2020-08-01,strip,1
9,2020-08-01,virtual machines,1


In [117]:
monthly_words_df.tail(10)

Unnamed: 0,month,word,count
20,2020-08-01,ideas,1
21,2020-08-01,5A3ApT,1
22,2020-08-01,disk stats,1
23,2020-08-01,builddir,1
24,2020-08-01,build,1
25,2020-08-01,BUILDROOT,1
26,2020-08-01,brpstrip,1
27,2020-08-01,fc33,1
28,2020-08-01,omake,1
29,2020-08-01,tmp,1


## Step 5: Save results



In [119]:
new_files = []
dataset_base_path = Path(f"{BASE_PATH}/processed/keywords/")
dataset_base_path.mkdir(parents=True, exist_ok=True)

monthly_words_df.to_csv(
    f"{BASE_PATH}/processed/keywords/watson-nlu-keywords.csv", header=False
)
new_files.append(f"{BASE_PATH}/processed/keywords/watson-nlu-keywords.csv")

In [120]:
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/keywords/{Path(f).stem}.csv") for f in new_files
    )

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.