# HateSonar Analysis

This notebook ingests the preprocessed data from `../interim/metadata` downloaded by `download_datasets.ipynb` and quantifies the levels of hate speech or offensive language in each of the emails.

Finally, the analyses are merged and saved as a single csv file that is pushed to remote storage.

HateSonar identifies the weight that a text matches three different categories: hate speech, offensive language, or neither and gives the top result. 

In [7]:
!pip install hatesonar
!pip install scikit-learn==0.20.3



In [8]:
import pandas as pd
import os
import re
import datetime
from pathlib import Path
from dotenv import load_dotenv
from hatesonar import Sonar

load_dotenv("../../.env")

import sys

sys.path.append("../..")
from src import utils
import warnings

warnings.filterwarnings("ignore")

In [5]:
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data/")

In [6]:
df = utils.load_dataset(f"{BASE_PATH}/interim/text/")

In [7]:
df.head()

Unnamed: 0,Message-ID,Date,Body
0,<23f4b2992626d689b84a704a575d974cc794709e.came...,"Fri, 31 Jul 2020 18:41:49 -0600","['On Fri, 2020-07-31 at 19:26 +0100, Richard W..."
1,<CAB_b4sBOn9Bisre7D3pUrDmH9+3unoP5VaeRGi031ks3...,"Sat, 01 Aug 2020 11:07:52 +0800",['Jerry James <loganjerry(a)gmail.com> =E4=BA=...
2,<CAJP_izdx=xTviDd4piWMLvxua7Ti8wD81kwqFEB7ucbG...,"Sat, 01 Aug 2020 03:25:48 -0400","['libcroco was retired on Rawhide, but the lib..."
3,<rg3f65$16fd$1@ciao.gmane.io>,"Sat, 01 Aug 2020 12:12:21 +0200","['Hi,\n\nseeing the amount of fallout from LTO..."
4,<rg3fi2$ipa$1@ciao.gmane.io>,"Sat, 01 Aug 2020 12:18:41 +0200",['Neal Gompa wrote:\n> I think it does have va...


## Text Preprocessing

Due to the casual nature of email writing, along with some known useless artifacts present in our textual dataset, we need to clean our data a bit before performing our analysis. 

In [8]:
def strip_thread(text):
    text = text.replace("\r", "")
    lines = text.split("\n")
    lines = [line for line in lines if len(line) > 0]
    lines = [line for line in lines if line[0] != ">"]
    lines = [line for line in lines if line[:3] != "Re:"]
    lines = [line for line in lines if line[:7] != "Subject"]
    lines = [line for line in lines if line[:5] != "From:"]
    lines = [line for line in lines if line[:5] != "Date:"]
    lines = [line for line in lines if "BEGIN PGP SIGNED MESSAGE" not in line]
    lines = [line for line in lines if line[:5] != "Hash:"]
    lines = [line for line in lines if line[:10] != "Version: G"]
    lines = [line for line in lines if "wrote:" not in line]
    lines = [line for line in lines if "wrote :" not in line]
    lines = [line for line in lines if "writes:" not in line]
    lines = [line for line in lines if line[:7] != "Am Mit,"]
    lines = [line for line in lines if line[:7] != "Am Don,"]
    lines = [line for line in lines if line[:7] != "Am Mon,"]
    lines = [line for line in lines if line[:7] != "Quoting"]
    lines = [line for line in lines if line[:10] != "Em Quinta,"]
    lines = [line for line in lines if "said:" not in line]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), .. (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            (
                ".*n (Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday) .."
                " (January|February|March|April|May|June|July|August|September|October|November|December) 20..*"
            ),
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) .., 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            r".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), 20[\d]{2}-[\d]{2}-[\d]{2} at.*",
            line,
        )
        is None
    ]
    lines = [line for line in lines if line[-6:] != "said: "]
    lines = [line for line in lines if line[-8:] != "babbled:"]
    lines = [line for line in lines if line[-7:] != "wrot=e:"]
    lines = [line for line in lines if line[-8:] != "A9crit :"]
    lines = [line for line in lines if line[0] != "|"]
    return "\n".join(lines)


# format for CSV, clean special characters, and remove extranous emails
def pandas_clean(emails):
    emails["Body"].replace(
        to_replace=[
            r"\n",
            "\n",
        ],
        value=" ",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\'", "'", ">", "<", "= ", "-", r"http\S+"],
        value="",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\\\s+", r"\\s+", "="], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["_", "3D"], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"] = emails["Body"].apply(
        lambda x: x.strip().replace(r"\n", "")
    )

    emails.drop(emails.index[emails["Body"] == ""], inplace=True)
    emails.drop(emails.index[emails["Body"] == " "], inplace=True)
    emails.dropna(subset=["Body"], inplace=True)

    emails = emails.reset_index()
    emails.drop("index", axis=1, inplace=True)
    return emails

In [9]:
clean = df.copy()
clean["Body"] = clean["Body"].apply(strip_thread)
clean = pandas_clean(clean)
clean

Unnamed: 0,Message-ID,Date,Body
0,<CAB_b4sBOn9Bisre7D3pUrDmH9+3unoP5VaeRGi031ks3...,"Sat, 01 Aug 2020 11:07:52 +0800",[Jerry James loganjerry(a)gmail.com E4BA8E 202...
1,<CAJP_izdx=xTviDd4piWMLvxua7Ti8wD81kwqFEB7ucbG...,"Sat, 01 Aug 2020 03:25:48 -0400","[libcroco was retired on Rawhide, but the libc..."
2,<rg3f65$16fd$1@ciao.gmane.io>,"Sat, 01 Aug 2020 12:12:21 +0200","[Hi,seeing the amount of fallout from LTO, I r..."
3,<20200801121236.4381.17318@mailman01.iad2.fedo...,"Sat, 01 Aug 2020 12:12:36 +0000","[Well, that second mass rebuild made things wo..."
4,<D15334F0-3457-42A9-8E18-601002F1302D@barrys-e...,"Sat, 01 Aug 2020 13:24:13 +0100","[""I see that this ticket is still NEW.Ive upda..."
...,...,...,...
14697,<20190329164043.GA10522@branched-composer.phx2...,"Fri, 29 Mar 2019 16:40:43 +0000",[OLD: Fedora3020190326.n.0NEW: Fedora302019032...
14698,<20190329173043.DA4F76079248@bastion01.phx2.fe...,"Fri, 29 Mar 2019 17:30:43 +0000",[Missing expected images:Atomichost rawxz x866...
14699,<654338f6-25fe-37fd-9101-c095e9200545@doubledo...,"Fri, 29 Mar 2019 14:47:35 -0400","[""I know its not unusual to carry builds over ..."
14700,<cd084ec7-bda8-57c0-c1f2-ea7f2c48f335@redhat.com>,"Fri, 29 Mar 2019 19:58:33 +0100","[""Dne 29. 03. 19 v 19:47 John Florian napsal(a..."


In [10]:
clean["Date"] = clean["Date"].apply(lambda x: pd.to_datetime(x))
clean["Chunk"] = clean["Date"].apply(
    lambda x: datetime.date(x.year, x.month, 1)
)
clean = clean.sort_values(by="Date")
clean.reset_index(inplace=True, drop=True)
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk
0,<20180101220004.0632660A400B@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,"[Dear all,You are kindly invited to the meetin...",2018-01-01
1,<20180101220004.0E97560A400C@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,"[Dear all,You are kindly invited to the meetin...",2018-01-01
2,<20180101221314.GA52721@rawhide-composer.phx2....,2018-01-01 22:13:15+00:00,[OLD: FedoraRawhide20171231.n.0NEW: FedoraRawh...,2018-01-01
3,<20180101233509.D734E60478E3@bastion01.phx2.fe...,2018-01-01 23:35:09+00:00,[Missing expected images:Server dvd i386Workst...,2018-01-01
4,<66075732-52f6-2eb8-de1b-d89ec18244db@redhat.com>,2018-01-02 10:26:51+01:00,"[""Could you please drop the dependency on GCC ...",2018-01-01


In [11]:
clean.tail()

Unnamed: 0,Message-ID,Date,Body,Chunk
14697,<20210227161758.B43EC304C540@bastion01.iad2.fe...,2021-02-27 16:17:58+00:00,[No missing expected images.Compose FAILS prop...,2021-02-01
14698,<20210227183412.4CCC7307262F@bastion01.iad2.fe...,2021-02-27 18:34:12+00:00,[No missing expected images.Failed openQA test...,2021-02-01
14699,<346ef226-3317-c310-d80c-283e4cc7dc2d@redhat.com>,2021-02-27 20:30:45+01:00,"[Hi Benjamin, Ray,I noticed this problem while...",2021-02-01
14700,<8dee2ff2-e118-bdb2-5d77-20ca82759727@gmail.com>,2021-02-27 20:59:59+01:00,"[Hi,I am trying to test some Renoir s2idle pat...",2021-02-01
14701,<4199adc3-49c8-4d3d-d768-84327df177fa@gmail.com>,2021-02-27 18:56:52-05:00,[The assimp license field for version 5.0.1 ha...,2021-02-01


## Hate sonar snalysis on whole dataset


In [9]:
sonar = Sonar()

In [14]:
def speech(n):
    # sonar = Sonar()
    t = sonar.ping(text=n)
    top = t["top_class"]
    hate = t["classes"][0]["confidence"]
    off = t["classes"][1]["confidence"]
    neither = t["classes"][2]["confidence"]
    return [top, hate, off, neither]


def get_val(val):
    return val[loc]

In [15]:
clean["sonar"] = clean["Body"].apply(speech)
loc = 0
clean["Top"] = clean["sonar"].apply(get_val)
loc = 1
clean["Hate Speech"] = clean["sonar"].apply(get_val)
loc = 2
clean["Offensive Language"] = clean["sonar"].apply(get_val)
loc = 3
clean["Neither"] = clean["sonar"].apply(get_val)

In [16]:
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk,sonar,Top,Hate Speech,Offensive Language,Neither
0,<20180101220004.0632660A400B@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,"[Dear all,You are kindly invited to the meetin...",2018-01-01,"[neither, 0.07996127979422231, 0.3331293663946...",neither,0.079961,0.333129,0.586909
1,<20180101220004.0E97560A400C@fedocal02.phx2.fe...,2018-01-01 22:00:04+00:00,"[Dear all,You are kindly invited to the meetin...",2018-01-01,"[neither, 0.08164312982342418, 0.3330956077948...",neither,0.081643,0.333096,0.585261
2,<20180101221314.GA52721@rawhide-composer.phx2....,2018-01-01 22:13:15+00:00,[OLD: FedoraRawhide20171231.n.0NEW: FedoraRawh...,2018-01-01,"[neither, 0.03325657099633886, 0.3733650971099...",neither,0.033257,0.373365,0.593378
3,<20180101233509.D734E60478E3@bastion01.phx2.fe...,2018-01-01 23:35:09+00:00,[Missing expected images:Server dvd i386Workst...,2018-01-01,"[neither, 0.039981707371010575, 0.326850382054...",neither,0.039982,0.32685,0.633168
4,<66075732-52f6-2eb8-de1b-d89ec18244db@redhat.com>,2018-01-02 10:26:51+01:00,"[""Could you please drop the dependency on GCC ...",2018-01-01,"[neither, 0.04388574198143961, 0.4128345886699...",neither,0.043886,0.412835,0.54328


### Offensive Lanuage classification

From high level anaylsis, it seems like multiple of the messages flagged either had a lot of excess text (most likely from links) or had more direct lanaguage when explaining issues 

In [17]:
offensive_df = clean.loc[clean["Top"] == "offensive_language"]

offensive_df.head()

Unnamed: 0,Message-ID,Date,Body,Chunk,sonar,Top,Hate Speech,Offensive Language,Neither
316,<6f26913b-d3cd-7ef2-000e-9f5931db179b@redhat.com>,2018-01-25 09:26:52+01:00,"[Just to illustrate what this is about, these ...",2018-01-01,"[offensive_language, 0.055735411386999514, 0.4...",offensive_language,0.055735,0.474144,0.470121
421,<ufabmh7arll.fsf@epithumnia.math.uh.edu>,2018-02-02 17:39:50-06:00,"[""Actually comprehending your message, I see i...",2018-02-01,"[offensive_language, 0.05987457809697904, 0.47...",offensive_language,0.059875,0.476603,0.463522
517,<CALC7GWx5vt10tK9m4PajtnEZN6kqNDE+4m==MTJq_8Dr...,2018-02-13 02:00:46+01:00,"[I don\t think, removing the changelog entirel...",2018-02-01,"[offensive_language, 0.0594192768366574, 0.484...",offensive_language,0.059419,0.484444,0.456137
611,<20180218173857.12956.59900@mailman01.phx2.fed...,2018-02-18 17:38:57+00:00,"[ If you fixed package(s), Just to make sure: ...",2018-02-01,"[offensive_language, 0.03715599088459851, 0.49...",offensive_language,0.037156,0.491301,0.471543
616,<CABB28CxRa5NdyPp76wA88FRQm1rc8=A5TQgonhu1f+oQ...,2018-02-18 20:50:06+00:00,"[""On 18 February 2018 at 18:06, Stephen John S...",2018-02-01,"[offensive_language, 0.056053959942993656, 0.4...",offensive_language,0.056054,0.478286,0.465661


In [18]:
len(offensive_df)

42

### Hate Speech classification

Hate Sonar did not identify any of the messages in this set as hate speech. This is something to look into further to see why this is the case even if it is that there is no emails in this set that would be classifed as hate speeh

## Upload results to S3

In [19]:
new_files = ((clean, f"{BASE_PATH}/processed/hatesonar.csv"),)

In [20]:
Path(f"{BASE_PATH}/processed").mkdir(parents=True, exist_ok=True)

In [21]:
clean.to_csv(new_files[0][1], header=False)

In [22]:
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/{Path(f).stem}/hatesonar.csv") for _, f in new_files
    )