## Reading and Interpreting the Dataset

The following section is using pandas to read the dataframe of the 'labeled_and_scored_comments.csv' downloaded from the Data Bias assignment. This is a dataset of Wikipedia comments made available by Jigsaw, a subsidiary of Google that created the Perspective tool. The dataset includes a unique comment id, the text of the comment, and a series of binary labels applied by human raters: "toxic," "severe_toxic," "obscene," "threat," "insult," and "identity_hate." The "score" column was appended by Professor Engler, which represents the toxicity score assigned to the comment text by the live version of the Perspective API. The data is available under a CC0 license.

In [15]:
import pandas as pd
import time

df = pd.read_csv('labeled_and_scored_comments.csv')

In [16]:
df.sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
1158,61a85266de70cca0,Можешь говорить по русски.,0.000004,0,0,0,0,0,0
5246,030a9ba0b32b4456,Rodzina Krusenstern Potomkowie Admirała Adama ...,0.000009,0,0,0,0,0,0
11040,a6814193ac6b4705,The opening paragraph states that the game is ...,0.000091,0,0,0,0,0,0
24192,f0d6dee96e057a5b,The book is used as a source in a number of ar...,0.000116,0,0,0,0,0,0
2332,cbb9268d3c7429c1,Vals Vienés \n |Rumba \n |Cha Cha Chá(ciclo 7),0.000131,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
19994,1429a2c930688317,you fucking piece of shit,0.997278,1,1,1,0,1,0
28728,d82e45eb103ad07b,fuck off you stupid aspy asshole,0.997736,1,1,1,0,1,0
25535,e9b92c6c6b01aeef,FUCK YOU YOU PIECE OF SHIT \n FUCK YOU YOU PIE...,0.997982,1,1,1,0,1,0
25945,e783fd267f3a9d3b,FUCK WIKIPEDIA ON WHEELS! \n\nFuck off wikiped...,0.998136,1,1,1,0,1,0


#### Describe the Dataset

The data description notes an n value of 41338, which is sufficiently large.

In [17]:
df.describe()

Unnamed: 0,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,41338.0,41338.0,41338.0,41338.0,41338.0,41338.0,41338.0
mean,0.244467,0.095384,0.009168,0.05305,0.003024,0.049809,0.009725
std,0.257221,0.293749,0.095313,0.224137,0.054907,0.217553,0.098134
min,4e-06,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.074772,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.128969,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.310894,0.0,0.0,0.0,0.0,0.0,0.0
max,0.998329,1.0,1.0,1.0,1.0,1.0,1.0


In [14]:
pip install --upgrade google-api-python-client

Note: you may need to restart the kernel to use updated packages.


In [2]:
from googleapiclient.discovery import build
import json

def get_toxicity_score(comment):
    
  API_KEY = 'AIzaSyCIBu2QCaYQLk-t3il9Y6krJrgUFJoxHFI' # Put your API key here
    
  client = build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
  )

  analyze_request = {
  'comment': { 'text': comment },
  'requestedAttributes': {'TOXICITY': {}}
  }
    
  response = client.comments().analyze(body=analyze_request).execute()
  toxicity_score = response["attributeScores"]["TOXICITY"]["summaryScore"]["value"]
    
  return toxicity_score

## False Positives

The context of the following comments aren't posed in a toxic or negative manner, but the resulting toxicity score calculated is high.

In [19]:
get_toxicity_score("i'm so hungry i could eat a pig")

0.8060606

In [20]:
get_toxicity_score("hey bitch i love you")

0.93232936

In [47]:
get_toxicity_score("she's serving cunt")

0.94992846

### Finding an example of a comment with a high score that isn't necessarily presented in a toxic manner.

In [36]:
df[df['comment_text'].str.contains("bitch")].sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
2893,550737d60a327c21,""" \n\n == This, that and the other thing.........",0.258043,0,0,0,0,0,0
39803,9cbd4a5b86d55e6f,"""\n\nHey, I wasn't intending to be rude but ir...",0.273786,0,0,0,0,0,0
32776,c2c3f3e7b4841bf1,(Career Change) \n\nThe latest Popbitch mailou...,0.357251,0,0,0,0,0,0
19078,229cc6cdff7035e4,"""For those who want to bitch about me any beha...",0.464582,0,0,0,0,0,0
19877,161db0d904c2f062,Tony Sidaway\nThe article Jim Nussle has been ...,0.473755,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
15219,613f3490153d346e,Fuck you \n\nYou're a stupid bitch 142.162.85.154,0.994489,1,0,1,0,1,0
21751,fd6e58a896118b31,fuck off you son of a bitch!!,0.994951,1,1,1,0,1,0
23096,f68d9ef333909f03,Fucking whore \n\nYou are such a dumb bitch i ...,0.995611,1,0,1,0,1,0
18792,26fed2949c2dadc1,you son of a bitch RIP you motherfucker,0.996669,1,0,1,0,1,0


In [27]:
print(df.iloc[40381]['comment_text'])

"==BLACK SLANG==
This article suggests that using ""bitch"" to describe a young woman is exclusive to hip hop culture. This is not the case. This is THE prominent term to describe a woman in the African-American community. How many black people do you know who actually call a woman a woman? None. They all say ""bitch"" as in ""Ah fucked dis bitch las' naht nigga. She wa' FAAAHN."" Black people will NEVER say ""woman"" because it is ingrained in their culture to be derogatory towards women! THAT is why it is used in hip hop, because that is black entertainment. Misogyny is a crucial element of African0American culture and this is why black people throw around the word ""bitch"" as a synonym for ""woman"". It's the same way how black people never call a song a ""song"" they will always call it a ""track"" and if you call it a song the black people will think you're a ""square"" ad ""not hip."" That's how they are and it should be in the article.

"


In [46]:
print(df.iloc[40381]['score'])

0.825715
