<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-packages-&amp;-dataset" data-toc-modified-id="Importing-packages-&amp;-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing packages &amp; dataset</a></span></li><li><span><a href="#Exploring-the-dataset" data-toc-modified-id="Exploring-the-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring the dataset</a></span><ul class="toc-item"><li><span><a href="#General-features" data-toc-modified-id="General-features-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>General features</a></span></li><li><span><a href="#Getting-to-know-the-labels" data-toc-modified-id="Getting-to-know-the-labels-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Getting to know the labels</a></span><ul class="toc-item"><li><span><a href="#Can-a-comment-be-both-toxic-and-severely-toxic?" data-toc-modified-id="Can-a-comment-be-both-toxic-and-severely-toxic?-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Can a comment be both toxic and severely toxic?</a></span></li><li><span><a href="#Do-the-obsene/threat/insult/identity_hate-tags-come-under-the-toxic-umbrella?" data-toc-modified-id="Do-the-obsene/threat/insult/identity_hate-tags-come-under-the-toxic-umbrella?-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Do the obsene/threat/insult/identity_hate tags come under the toxic umbrella?</a></span></li><li><span><a href="#Adding-a-non-toxic-column:" data-toc-modified-id="Adding-a-non-toxic-column:-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>Adding a non-toxic column:</a></span></li><li><span><a href="#Removing-the-total_score-column" data-toc-modified-id="Removing-the-total_score-column-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>Removing the total_score column</a></span></li></ul></li></ul></li><li><span><a href="#Cleaning-the-comment_text-column" data-toc-modified-id="Cleaning-the-comment_text-column-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cleaning the comment_text column</a></span><ul class="toc-item"><li><span><a href="#IP-address" data-toc-modified-id="IP-address-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>IP address</a></span></li><li><span><a href="#Date-Time" data-toc-modified-id="Date-Time-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Date Time</a></span></li><li><span><a href="#Formatting-characters" data-toc-modified-id="Formatting-characters-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Formatting characters</a></span></li><li><span><a href="#Writing-functions-for-future-use-on-unseen-comments-data" data-toc-modified-id="Writing-functions-for-future-use-on-unseen-comments-data-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Writing functions for future use on unseen comments data</a></span></li></ul></li><li><span><a href="#Modeling-Trial-1:-Basic-spaCy-Multi-Label-Classification" data-toc-modified-id="Modeling-Trial-1:-Basic-spaCy-Multi-Label-Classification-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modeling Trial 1: Basic spaCy Multi-Label Classification</a></span></li><li><span><a href="#Results-&amp;-predictions" data-toc-modified-id="Results-&amp;-predictions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Results &amp; predictions</a></span></li></ul></div>

# Exploring spaCy: Toxicity Levels 

**Overview**

Using the SpaCy package to explore the [Toxic Comment Classification Challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview). 

- Challenge: build a model that’s capable of detecting different types of of toxicity like **threats, obscenity, insults, and identity-based hate**. 
- The dataset contains comments from Wikipedia’s talk page edits. 
- The goal is to help online discussion become more productive and respectful.

**Dataset Description**

A large number of Wikipedia comments which have been labeled by human raters for toxic behavior. 
The types of toxicity are:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a **probability of each type of toxicity** for each comment.

**File descriptions**

- train.csv - the training set, contains comments with their binary labels
- test.csv - the test set, you must predict the toxicity probabilities for these comments.
- sample_submission.csv - a sample submission file in the correct format
- test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

## Importing packages & dataset

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.set_printoptions(precision=4)
sns.set(font_scale=1.5)
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [4]:
df = pd.read_csv("/Users/anastasiakuzmich/Desktop/jigsaw-toxic-comment-classification-challenge/train.csv")

print("The dataset contains %s entries with %s features." 
      % (df.shape[0], df.shape[1]))

FileNotFoundError: [Errno 2] No such file or directory: '/Users/anastasiakuzmich/Desktop/jigsaw-toxic-comment-classification-challenge/train.csv'

## Exploring the dataset

### General features

In [None]:
df.info()

✏️ Okay, there's no missing values & the data types are fine. Within this dataset there is:

1. An id column which might just be easier to drop. 
2. The comment_text column which contains the content of the comment itself
3. The columns indicating the comment's annotation as toxic, severe_toxic, obscene, threat, insult or identity hate. Let's explore these further:

In [None]:
df.describe()

✏️ The labels are binary, so a comment is either of that label, or not of that label. 9% of the comments are toxic, 5% are obscene, 4% are insults, less than 1% are severely toxic, identity theft or a threat. 

Let's find out if these labels are are mutually exclusive + how many non-toxic comments there are.

In [None]:
df['total_score'] = (df['toxic'] 
                     + df["severe_toxic"] 
                     + df["obscene"] 
                     + df["threat"]
                     + df["insult"]
                     + df["identity_hate"])

df['total_score'].value_counts(normalize=True)

✏️ My takeaways at this stage:

- 90% of the comments aren't negative. That's good to know, but we are dealing with severe class imbalance...
- A good amount of comments ticks several boxes, so the labels are not exclusive. Some tick all 6 somehow? (*How angry do you have to be?*)
- The model needs to be capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. I need to think about how I'll need to restructure the data and what I want my predictors to be.

**Next questions to address:**

1. Can a comment be both toxic and severely toxic? 
2. Are toxic & severe toxic the umbrella categories, then obsene/threat/insult/identity_hate their sub-categories? Restructure and see.
3. Perhaps at this stage it is worth introducing a non-toxic column? So the model would almost predict whether the comments are toxic vs non-toxic vs severely toxic, then *if* it's toxic then what subcategory of toxic?

### Getting to know the labels

#### Can a comment be both toxic and severely toxic?

In [None]:
print("Number of toxic comments:", len(df[df['toxic'] == 1]))
print("Number of severely toxic comments:", len(df[df['severe_toxic'] == 1]))
print("Number of severely toxic and toxic comments:", len(df[(df['toxic'] == 1) & (df['severe_toxic'] == 1)]))

✏️ Yes. All severely toxic comments are also toxic, but not all toxic comments are severe. Perhaps it's worth making them exclusive within the dataframe? Could also do it later on after modelling to see if it would improve the score. 

#### Do the obsene/threat/insult/identity_hate tags come under the toxic umbrella?

In [None]:
print("Number of obsene comments that are not toxic:",
      len(df[(df['toxic'] == 0) & (df['obscene'] == 1)]))

print("Number of threatening comments that are not toxic:",
      len(df[(df['toxic'] == 0) & (df['threat'] == 1)]))

print("Number of insulting comments that are not toxic:",
      len(df[(df['toxic'] == 0) & (df['insult'] == 1)]))

print("Number of identity-hateful comments that are not toxic:",
      len(df[(df['toxic'] == 0) & (df['identity_hate'] == 1)]))

✏️ No, apparently they can be mutually exclusive? Which doesn't make complete sense to me. Let's inspect:

In [None]:
for category in ["obscene", "threat", "insult", "identity_hate"]:
    for i in range(1):
        print("Comment ", i+1, ": ", str(category),
              "\n", 
              df[(df['toxic'] == 0) & (df[category] == 1)]["comment_text"].iloc[i], 
              "\n")

✏️ Ummm... these look toxic to me? But I just re-read the challenge rules and I'm expected to classify each of the columns so I'll try not to overthink why these categories are the way they are. I will make an extra column for non-toxic, neutral comments though because in my head that will make the probabilities make sense?

#### Adding a non-toxic column:

In [None]:
def non_toxic_mapper(x):
    if x == 0: 
        return 1
    else:
        return 0

df["non_toxic"] = df["total_score"].map(non_toxic_mapper)
df["non_toxic"].value_counts()

#### Removing the total_score column

This column was only used to explore the data, so I'm dropping it before I start the pre-modelling stages. 

In [None]:
df = df.drop("total_score", axis=1)

df.head()

## Cleaning the comment_text column

✏️ Having read up on spaCy, it appears that its pipelines work best on natural sentences because that's what its training data looks like. Extensive preprocessing, like removing stop words and lowercasing will ззrently make things worse with spaCy because it uses that information for clues - [StackOverFlow](https://stackoverflow.com/a/70502883) .

That said, I will still be removing newlines and formatting characters:

In [None]:
# Inspecting examples

df["comment_text"].iloc[0]

✏️ Printing these would make the formatting characters invisible, so I ran the below in individual cells...

In [None]:
# df["comment_text"].iloc[1]
# df["comment_text"].iloc[5]
# df["comment_text"].iloc[10]
# df["comment_text"].iloc[20]
# df["comment_text"].iloc[21]

✏️ 

**To remove:**

Having inspected some of the comments, I decided I will be removing the following features from the data:

- IP addresses
- Date & Time (Format: + 21:51, January 11, 2016 (UTC))
- Newline characters (\n) & "\\" separator
- Non-breaking spaces (\xa0)

### IP address

In [None]:
import re

IpAddressRegex = "(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

for i in range(100):
    txt = df["comment_text"].iloc[i]
    x = re.findall(IpAddressRegex, txt)
    if len(x) > 0:
        print(x)

✏️ The regex works, so I'll use it to clean now:

In [None]:
df["comment_text"] = df["comment_text"].replace(IpAddressRegex,'', regex=True)

### Date Time

In [None]:
DateTimeRegex = "(\d{2}\:\d{2}\, [A-Z][a-z]{2,8}\ \d{1,2}\, \d{4}\ \([A-Z]{3}\))|(\d{2}\:\d{2}\, \d{2}\ [A-Z][a-z]{2}\ \d{4}\ \([A-Z]{3}\))"

for i in range(50):
    txt = df["comment_text"].iloc[i]
    x = re.findall(DateTimeRegex, txt)
    if len(x) > 0:
        print(x)

In [None]:
df["comment_text"] = df["comment_text"].replace(DateTimeRegex,'', regex=True)

**Checking for other pattern types:**

In [None]:
df[df['comment_text'].str.contains('Jan')]["comment_text"]

**Patterns to add:**

- 10:36, 5 January 2012
- 27 January 2010 (UTC)
- "Semi-protected edit request on "7 January 2014
- 01:29, 30
- January 2013 (UTC)

✏️ The rest of the dates are mostly mentioned in context. The data should be clear after I add the below formats:

In [None]:
updated_DateTimeRegex = "(\d{2}\:\d{2}\, [A-Z][a-z]{2,8}\ \d{1,2}\, \d{4}\ \([A-Z]{3}\))|(\d{2}\:\d{2}\, \d{2}\ [A-Z][a-z]{2}\ \d{4}\ \([A-Z]{3}\))|(\d{2}\:\d{2}\, \d{1,2}\ [A-Z][a-z]{2,8} \d{4})|(\d{1,2}\ [A-Z][a-z]{2,8}\ \d{4}\ \([A-Z]{3}\))|(\d{1,2}\ [A-Z][a-z]{1,7}\ \d{4})|(\d{2}\:\d{2}\, \d{1,2})|([A-Z][a-z]{2,8}\ \d{4}\ \([A-Z]{3}\))"
df['comment_text'] = df['comment_text'].replace(updated_DateTimeRegex,'', regex=True)

### Formatting characters

In [None]:
df['comment_text'] = df["comment_text"].replace('\\n', ' ', regex=True)
df['comment_text'] = df['comment_text'].replace(r'\\', ' ', regex=True)
df['comment_text'] = df['comment_text'].replace(u'\xa0', u' ')

### Writing functions for future use on unseen comments data

In [None]:
def non_toxic_mapper(x):
    
    """Maps the non-toxic column based on the total score column."""
    
    if x == 0: 
        return 1
    else:
        return 0

def add_non_toxic_column(df):
    
    """Adds a non-toxic column."""
    
    df['total_score'] = (df['toxic'] + df["severe_toxic"] 
                         + df["obscene"] + df["threat"]
                         + df["insult"] + df["identity_hate"])

    df["non_toxic"] = df["total_score"].map(non_toxic_mapper)
    df = df.drop("total_score", axis=1)
    
    return df

def clean_comments_column(df):
    
    """Cleans the comments column."""
    
    import re
    
    IpAddressRegex = "(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
    DateTimeRegex = "(\d{2}\:\d{2}\, [A-Z][a-z]{2,8}\ \d{1,2}\, \d{4}\ \([A-Z]{3}\))|(\d{2}\:\d{2}\, \d{2}\ [A-Z][a-z]{2}\ \d{4}\ \([A-Z]{3}\))|(\d{2}\:\d{2}\, \d{1,2}\ [A-Z][a-z]{2,8} \d{4})|(\d{1,2}\ [A-Z][a-z]{2,8}\ \d{4}\ \([A-Z]{3}\))|(\d{1,2}\ [A-Z][a-z]{1,7}\ \d{4})|(\d{2}\:\d{2}\, \d{1,2})|([A-Z][a-z]{2,8}\ \d{4}\ \([A-Z]{3}\))"
    
    features_to_remove = [IpAddressRegex,
                          DateTimeRegex,
                          '\\n', r'\\']
    
    for feature in features_to_remove:
        df["comment_text"] = df["comment_text"].replace(feature,' ', regex=True)
 
    df['comment_text'] = df['comment_text'].replace(u'\xa0', u' ')
    
    return df

## Modeling Trial 1: Basic spaCy Multi-Label Classification 

In [None]:
df.columns

In [None]:
y = df[['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate', 'non_toxic']]
y.head()

In [None]:
labels = list(y.columns)
y = y.to_dict("index")

dataset = list(zip(df["comment_text"], 
               [{'cats':cats} for cats in y.values()]))

print(dataset[0:2])

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(dataset, 
                                         train_size=0.8, 
                                         random_state=13)

In [None]:
train_data, test_data = train_test_split(dataset, 
                                         train_size=0.8, 
                                         random_state=13)

In [None]:
import spacy
nlp = spacy.blank("en")

from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL

config = {
   "threshold": 0.5,
   "model": DEFAULT_MULTI_TEXTCAT_MODEL,
}

textcat = nlp.add_pipe("textcat_multilabel", config=config)

In [None]:
for label in labels:
    textcat.add_label(label)
    
textcat.labels

In [5]:
optimizer = nlp.begin_training()
iterations = 2

NameError: name 'nlp' is not defined

In [None]:
from spacy.util import minibatch, compounding
from spacy.training import Example

with nlp.select_pipes(enable="textcat_multilabel"):
    for j in range(iterations):
        losses = {}
        k = 0
        batches = minibatch(train_data, size = compounding(4.,32.,1.001))
        for batch in batches:
            text, annotations = zip(*batch)
            example = []
            for i in range(len(text)):
                doc = nlp.make_doc(text[i])
                example.append(Example.from_dict(doc, annotations[i]))
            nlp.update(example, sgd=optimizer, drop=0.2, losses = losses)
            print('Batch No: {} Loss = {}'.format(k, round(losses['textcat_multilabel'])))
            k += 1
        print("\n\n Completed Iterations : {} ".format(j))

In [34]:
train_data[0:5]

[("Okay so no one's gonna address this? Guess that no one editing this page cares about accuracy.",
  {'cats': {'toxic': 0,
    'severe_toxic': 0,
    'obscene': 0,
    'threat': 0,
    'insult': 0,
    'identity_hate': 0,
    'non_toxic': 1}}),
 ('"   In regards to wishful thinking   ""Due to Leuchter\'s ignorance of the large disparity between the amounts of cyanide necessary to kill humans and lice, instead of disproving the homicidal use of gas chambers, the small amounts of cyanide which Leuchter detected actually tended to confirm it.[8]""  It doesn\'t explain how it tended to confirm it, also this whole sentence is presented as fact like everything else in this biased article. Thus I\'m assuming it\'s wishful thinking and only here to discredit Leutcher.   "',
  {'cats': {'toxic': 0,
    'severe_toxic': 0,
    'obscene': 0,
    'threat': 0,
    'insult': 0,
    'identity_hate': 0,
    'non_toxic': 1}}),
 ("Ok, so put the pictures somewhere in the article. Beirut cannot be the ma

In [33]:
train_comments, test_comments = train_test_split(df["comment_text"],
                                                 train_size=0.8,
                                                 random_state=13)

In [41]:
# Tokenize the data

texts = list(train_comments)
docs = [nlp.tokenizer(text) for text in texts]

In [1]:
# Use textcat to get the scores for each doc

textcat = nlp.get_pipe('textcat_multilabel')
scores, _ = textcat.predict(docs)
print(scores[0:5])

# The kernel died ...

NameError: name 'nlp' is not defined

In [28]:
nlp.to_disk('models/')

In [55]:
np.save('train.npy', train_data, allow_pickle=True)
np.save('test.npy', test_data, allow_pickle=True)

## Results & predictions

In [22]:
import spacy

nlp = nlp.from_disk('models/')

In [23]:
nlp

<spacy.lang.en.English at 0x7f91ddb0d670>

In [11]:
import numpy as np

train_data = np.load("train.npy", allow_pickle=True)
train_data = train_data.tolist()

test_data = np.load("test.npy", allow_pickle=True)
test_data = test_data.tolist()

In [12]:
processed = my_model(train_data)

TypeError: Argument 'string' has incorrect type (expected str, got list)