### Perspective API Exploration

First, we have a dataset of Wikipedia comments made available by Jigsaw, a subsidiary of Google that created the Perspective tool. The dataset includes a unique comment id, the text of the comment, and a series of binary labels applied by human raters: "toxic," "severe_toxic," "obscene," "threat," "insult," and "identity_hate." I have appended the "score" column, which represents the toxicity score assigned to the comment text by the live version of the Perspective API. The data is available under a CC0 license.

In [4]:
import pandas as pd
import time

df = pd.read_csv('labeled_and_scored_comments.csv')

In [8]:
df.sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
1158,61a85266de70cca0,Можешь говорить по русски.,0.000004,0,0,0,0,0,0
5246,030a9ba0b32b4456,Rodzina Krusenstern Potomkowie Admirała Adama ...,0.000009,0,0,0,0,0,0
11040,a6814193ac6b4705,The opening paragraph states that the game is ...,0.000091,0,0,0,0,0,0
24192,f0d6dee96e057a5b,The book is used as a source in a number of ar...,0.000116,0,0,0,0,0,0
2332,cbb9268d3c7429c1,Vals Vienés \n |Rumba \n |Cha Cha Chá(ciclo 7),0.000131,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
19994,1429a2c930688317,you fucking piece of shit,0.997278,1,1,1,0,1,0
28728,d82e45eb103ad07b,fuck off you stupid aspy asshole,0.997736,1,1,1,0,1,0
25535,e9b92c6c6b01aeef,FUCK YOU YOU PIECE OF SHIT \n FUCK YOU YOU PIE...,0.997982,1,1,1,0,1,0
25945,e783fd267f3a9d3b,FUCK WIKIPEDIA ON WHEELS! \n\nFuck off wikiped...,0.998136,1,1,1,0,1,0


In [28]:
print(df['score'].mean())

0.2444668346830471


In [44]:
df_toxic, df_nontoxic = [x for _, x in df.groupby(df['toxic'] < 1)]
df_nontoxic.sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
1158,61a85266de70cca0,Можешь говорить по русски.,0.000004,0,0,0,0,0,0
5246,030a9ba0b32b4456,Rodzina Krusenstern Potomkowie Admirała Adama ...,0.000009,0,0,0,0,0,0
11040,a6814193ac6b4705,The opening paragraph states that the game is ...,0.000091,0,0,0,0,0,0
24192,f0d6dee96e057a5b,The book is used as a source in a number of ar...,0.000116,0,0,0,0,0,0
2332,cbb9268d3c7429c1,Vals Vienés \n |Rumba \n |Cha Cha Chá(ciclo 7),0.000131,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
4436,59581cdba9297d11,== YOU suck == \n YOU SUCK ~~~~Insert non-form...,0.967137,0,0,0,0,0,0
15372,5eef195f8b3e1c6c,"As I said, you're an asshole and that's all th...",0.968973,0,0,1,0,1,0
3884,e4c69d7040f12681,"""Whatever everyone knowz """"B"""" is his ONE&ONLY...",0.973927,0,0,0,0,0,0
1057,56902f9d41d0696c,fuck y'all all of yall,0.977703,0,0,0,0,0,0


In [49]:
print(df_nontoxic['score'].mean())

0.18234075648677098


In [45]:
df_toxic.sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
34556,b8faa0ed6f557ad9,And we have a winner for the douchiest comment...,0.054399,1,0,0,0,0,0
30103,d0bae6a8c78773bf,"Look, what's your problem, kid? Have you got s...",0.079995,1,0,0,0,0,0
15367,5f04a07ebb59fa3d,Alansohn tucks his sack back every now and the...,0.082855,1,0,0,0,0,0
19650,1942efd2dad5e9c5,"You need to end this now, cold turkey.",0.134848,1,0,0,0,0,0
38493,a39b4836714430a1,SMOKE WEED ERRYDAY RIGHT BEFORE CLASS....PROFE...,0.140743,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
19994,1429a2c930688317,you fucking piece of shit,0.997278,1,1,1,0,1,0
28728,d82e45eb103ad07b,fuck off you stupid aspy asshole,0.997736,1,1,1,0,1,0
25535,e9b92c6c6b01aeef,FUCK YOU YOU PIECE OF SHIT \n FUCK YOU YOU PIE...,0.997982,1,1,1,0,1,0
25945,e783fd267f3a9d3b,FUCK WIKIPEDIA ON WHEELS! \n\nFuck off wikiped...,0.998136,1,1,1,0,1,0


In [51]:
print(df_toxic['score'].mean())

0.8336640688067462


I've also included a function to make calls to the Perspective API for your own testing. You will need to generate your own API key according to the instructions in the assignment.

In [9]:
from googleapiclient import discovery
import json

API_KEY = ''

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

analyze_request = {
  'comment': { 'text': 'friendly greetings from python' },
  'requestedAttributes': {'TOXICITY': {}}
}

response = client.comments().analyze(body=analyze_request).execute()
print(json.dumps(response, indent=2))

{
  "attributeScores": {
    "TOXICITY": {
      "spanScores": [
        {
          "begin": 0,
          "end": 30,
          "score": {
            "value": 0.24173127,
            "type": "PROBABILITY"
          }
        }
      ],
      "summaryScore": {
        "value": 0.24173127,
        "type": "PROBABILITY"
      }
    }
  },
  "languages": [
    "en"
  ],
  "detectedLanguages": [
    "en"
  ]
}


In [1]:
from googleapiclient.discovery import build
import json

def get_toxicity_score(comment):
    
  API_KEY = ''
    
  client = build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
  )

  analyze_request = {
  'comment': { 'text': comment },
  'requestedAttributes': {'TOXICITY': {}}
  }
    
  response = client.comments().analyze(body=analyze_request).execute()
  toxicity_score = response["attributeScores"]["TOXICITY"]["summaryScore"]["value"]
    
  return toxicity_score

We can call this function with original comments like this:

In [38]:
get_toxicity_score("I think you are not correct")

0.26284826

Due to the constraint of one request per second, we should insert a time.sleep() statement between consecutive API calls, which will pause execution for one second.

In [52]:
comment_list = ['white woman', 'caucasian woman', 'black woman', 'hispanic woman', 'asian woman', 'latina woman', 'african american woman',
               'white man', 'caucasian man', 'black man', 'hispanic man', 'asian man', 'latino man', 'african american man',
                'white people','black people', 'hispanic people', 'asian people', 'latino people', 'african american people',
               'white scientist','black scientist', 'hispanic scientist', 'asian scientist', 'latino scientist', 'african american scientist',
               'white historian', 'black historian', 'hispanic historian', 'latino historian', 'african american historian',
               'white researcher', 'black researcher', 'hispanic researcher', 'asian researcher', 'latino researcher', 'african american researcher']

for comment in comment_list:
    score = get_toxicity_score(comment)
    print(comment, score)
    time.sleep(1)

white woman 0.23009336
caucasian woman 0.14118251
black woman 0.36812845
hispanic woman 0.10568296
asian woman 0.13405651
latina woman 0.1647946
african american woman 0.19129816
white man 0.28593
caucasian man 0.18460794
black man 0.43249154
hispanic man 0.11973065
asian man 0.15583922
latino man 0.14942342
african american man 0.16419663
white people 0.30249438
black people 0.4596299
hispanic people 0.15037459
asian people 0.15464239
latino people 0.1406447
african american people 0.20994255
white scientist 0.15005042
black scientist 0.22996649
hispanic scientist 0.10225325
asian scientist 0.14634414
latino scientist 0.10336974
african american scientist 0.15856406
white historian 0.12803568
black historian 0.15799671
hispanic historian 0.12362919
latino historian 0.09875372
african american historian 0.24303061
white researcher 0.16106161
black researcher 0.26593867
hispanic researcher 0.14935419
asian researcher 0.19595867
latino researcher 0.11478338
african american researcher 0.

Strings excluded due to unsupported language automatically detected: 'asian historian',