# BERT + Jigsaw: toxicity detection in the TedTalk datasets

#### Cameron Clarke, 04/29/2021

## Load data

In [7]:
import numpy as np
from scipy.special import softmax

In [2]:
df_2014 = np.loadtxt('/home/ccc779/ted_talks_iwslt-0-train-16:17:28-eval.out')

In [3]:
df_2015 = np.loadtxt('/home/ccc779/ted_talks_iwslt-1-train-16:17:39-eval.out') 

In [4]:
df = np.vstack((df_2014, df_2015))

## Classification: positive class probabilities

#### Get class probabilities

In [8]:
pos_scores = softmax(df, axis=1)[:, 1]

In [10]:
sum(pos_scores > 0.5)

47

#### Average score

In [14]:
np.average(pos_scores)

0.00760180195725067

#### Score variance

In [15]:
np.var(pos_scores)

0.006292605633250623

#### Score quantiles (deciles)

In [33]:
pos_score_deciles = {round(prop, 1) : np.quantile(pos_scores, q=prop) for prop in np.arange(start=0.1, stop=1, step=0.1)}

In [34]:
pos_score_deciles

{0.1: 2.6312316414365275e-05,
 0.2: 2.7442732690052506e-05,
 0.3: 2.8732080026177693e-05,
 0.4: 3.0296154336716905e-05,
 0.5: 3.313242519161377e-05,
 0.6: 3.943482024661601e-05,
 0.7: 6.070303201986001e-05,
 0.8: 0.00014524627277485058,
 0.9: 0.0008943889191192632}

## Classification: class assignments

#### Get 1/0 class assignments

In [11]:
class_assigns = np.argmax(df, axis=1)

In [12]:
class_assigns

array([0, 0, 0, ..., 0, 0, 0])

In [13]:
sum(class_assigns)

47

#### Average of class assignments

In [26]:
np.average(class_assigns)

0.007213014119091467

#### Variance of class assignments


In [35]:
np.var(class_assigns)

0.007160986546409252

## The positive/"toxic" examples 

In [36]:
import datasets

In [45]:
nl_en_2014_dataset = datasets.load_dataset('ted_talks_iwslt', 'nl_en_2014')

Reusing dataset ted_talks_iwslt (/home/ccc779/.cache/huggingface/datasets/ted_talks_iwslt/nl_en_2014/1.1.0/caf519a0a183db297ca5f39dbfd42de3a415aaa79b5a638edd4fd7a3e3b0e545)


In [43]:
nl_en_2015_dataset = datasets.load_dataset('ted_talks_iwslt', 'nl_en_2015')

Downloading and preparing dataset ted_talks_iwslt/nl_en_2015 (download: 1.55 GiB, generated: 1.23 MiB, post-processed: Unknown size, total: 1.55 GiB) to /home/ccc779/.cache/huggingface/datasets/ted_talks_iwslt/nl_en_2015/1.1.0/caf519a0a183db297ca5f39dbfd42de3a415aaa79b5a638edd4fd7a3e3b0e545...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset ted_talks_iwslt downloaded and prepared to /home/ccc779/.cache/huggingface/datasets/ted_talks_iwslt/nl_en_2015/1.1.0/caf519a0a183db297ca5f39dbfd42de3a415aaa79b5a638edd4fd7a3e3b0e545. Subsequent calls will reuse this data.


In [51]:
num_rows_2014 = len(nl_en_2014_dataset['train'])

In [52]:
num_rows_2015 = len(nl_en_2015_dataset['train'])

In [65]:
pos_inds_2014 = [i for i in range(num_rows_2014) if class_assigns[i] == 1]

In [57]:
pos_inds_2015 = [i for i in range(num_rows_2015) if class_assigns[i + num_rows_2014] == 1]

#### Toxic cases in the 2014 dataset

In [66]:
for i in pos_inds_2014:
    print(i, ':', nl_en_2014_dataset['train']['translation'][i]['en'])

107 : Grégoire Courtine: The paralyzed rat that walked
179 : William Li: Can we eat to starve cancer?
290 : <i>Freakonomics</i> author Steven Levitt presents new data on the finances of drug dealing. Contrary to popular myth, he says, being a street-corner crack dealer isn’t lucrative: It pays below minimum wage. And your boss can kill you.
309 : Rose George: Let's talk crap. Seriously.
594 : What is killing the Tasmanian devil? A virulent cancer is infecting them by the thousands  -- and unlike most cancers, it's contagious. Researcher Elizabeth Murchison tells us how she's fighting to save the Taz, and what she's learning about all cancers from this unusual strain. Contains disturbing images of facial cancer.
595 : Elizabeth Murchison: Fighting a contagious cancer
975 : David Deutsch: Chemical scum that dream of distant quasars
1071 : Bart Knols: Cheese, dogs and a pill to kill mosquitoes and end malaria
1654 : Investor and prankster Yossi Vardi delivers a ballsy lecture on the dange

#### Toxic cases in the 2015 dataset

In [64]:
for i in pos_inds_2015:
    print(i, ':', nl_en_2015_dataset['train']['translation'][i]['en'])

31 : Bel Pesce: 5 ways to kill your dreams
227 : Meaghan Ramsey: Why thinking you're ugly is bad for you
257 : Zak Ebrahim: I am the son of a terrorist. Here's how I chose peace.
321 : Lorrie Faith Cranor: What’s wrong with your pa$$w0rd?
485 : Shereen El Feki: A little-told tale of sex and sensuality
569 : Grégoire Courtine: The paralyzed rat that walked
827 : John McWhorter: Txtng is killing language. JK!!!
837 : Rose George: Let's talk crap. Seriously.
883 : David Anderson: Your brain is more than a bag of chemicals
941 : iO Tillett Wright: Fifty shades of gay
1019 : Ernesto Sirolli: Want to help someone? Shut up and listen!
1021 : Candy Chang: Before I die I want to...
1066 : What does a disgusting image have to do with how you vote? Equipped with surveys and experiments, psychologist David Pizarro demonstrates a correlation between sensitivity to disgusting cues -- a photo of feces, an unpleasant odor -- and moral and political conservatism. <em>(Filmed at TEDxEast.)</em>
1281 : D