## Are comments using profanity doomed to a high toxicity score?

**Disclaimer:** This data analysis examines profanity within a Data Frame.

**1**. First, I import pandas and with it the CSV data set.

In [37]:
import pandas as pd
import numpy as np

df = pd.read_csv('labeled_and_scored_comments.csv')

**2**. Then, I plotted the csv just to verify the file is correct

In [38]:
df.head()

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,Thank you for understanding. I think very high...,0.075638,0,0,0,0,0,0
1,000247e83dcc1211,:Dear god this site is horrible.,0.450459,0,0,0,0,0,0
2,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0.667964,0,0,0,0,0,0
3,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0.068434,0,0,0,0,0,0
4,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0.151724,0,0,0,0,0,0


**3**. I imported the Google API Client, and included my own API key.

In [39]:
from googleapiclient.discovery import build
import json

def get_toxicity_score(comment):
    
  API_KEY = 'KEY' 
  client = build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
  )

  analyze_request = {
  'comment': { 'text': comment },
  'requestedAttributes': {'TOXICITY': {}}
  }
    
  response = client.comments().analyze(body=analyze_request).execute()
  toxicity_score = response["attributeScores"]["TOXICITY"]["summaryScore"]["value"]
    
  return toxicity_score

**4**. I decided to see what happens if I perform a command to score all the comments in the data set using the df function to isolate the text under the 'comment_text' attribute.

In [40]:
comment_list = df['comment_text']

for comment in comment_list:
    score = get_toxicity_score(comment)
    print(comment, score)
    

Thank you for understanding. I think very highly of you and would not revert without discussion. 0.0756376
:Dear god this site is horrible. 0.4504588
"::: Somebody will invariably try to add Religion?  Really??  You mean, the way people have invariably kept adding ""Religion"" to the Samuel Beckett infobox?  And why do you bother bringing up the long-dead completely non-existent ""Influences"" issue?  You're just flailing, making up crap on the fly. 
 ::: For comparison, the only explicit acknowledgement in the entire Amos Oz article that he is personally Jewish is in the categories!    

 " 0.6679636
" 

 It says it right there that it IS a type. The ""Type"" of institution is needed in this case because there are three levels of SUNY schools: 
 -University Centers and Doctoral Granting Institutions 
 -State Colleges 
 -Community Colleges. 

 It is needed in this case to clarify that UB is a SUNY Center. It says it even in Binghamton University, University at Albany, State University 

HttpError: <HttpError 429 when requesting https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key=AIzaSyCXhIY2VpwwSc6p0KbISJQOQKJ-kNvyI30&alt=json returned "Quota exceeded for quota metric 'Analysis requests (AnalyzeComment)' and limit 'Analysis requests (AnalyzeComment) per minute' of service 'commentanalyzer.googleapis.com' for consumer 'project_number:1041034797911'.". Details: "[{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'RATE_LIMIT_EXCEEDED', 'domain': 'googleapis.com', 'metadata': {'quota_limit': 'AnalyzeRequestsPerMinutePerProject', 'consumer': 'projects/1041034797911', 'quota_metric': 'CommentAnalyzerService/analyze_requests', 'service': 'commentanalyzer.googleapis.com'}}]">

Note: This worked once, but produced an error message the second time. From the comments it did score I decided to determine a hypothesis.

**5**. I sorted the dataset in terms of 'score' value to see the comments rated highest and lowest.

In [41]:
df.sort_values(['score'])

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
1158,61a85266de70cca0,Можешь говорить по русски.,0.000004,0,0,0,0,0,0
5246,030a9ba0b32b4456,Rodzina Krusenstern Potomkowie Admirała Adama ...,0.000009,0,0,0,0,0,0
11040,a6814193ac6b4705,The opening paragraph states that the game is ...,0.000091,0,0,0,0,0,0
24192,f0d6dee96e057a5b,The book is used as a source in a number of ar...,0.000116,0,0,0,0,0,0
2332,cbb9268d3c7429c1,Vals Vienés \n |Rumba \n |Cha Cha Chá(ciclo 7),0.000131,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
19994,1429a2c930688317,you fucking piece of shit,0.997278,1,1,1,0,1,0
28728,d82e45eb103ad07b,fuck off you stupid aspy asshole,0.997736,1,1,1,0,1,0
25535,e9b92c6c6b01aeef,FUCK YOU YOU PIECE OF SHIT \n FUCK YOU YOU PIE...,0.997982,1,1,1,0,1,0
25945,e783fd267f3a9d3b,FUCK WIKIPEDIA ON WHEELS! \n\nFuck off wikiped...,0.998136,1,1,1,0,1,0


**My hypothesis, seeing the scores for the comments as well as the sorted values for the table, is that: the API will not rate a comment that uses profanity lower than a 0.5, even if profanity is not used in a negative context.**

Method: To test my hypothesis, I will attempt to extract the comments using profanity and describe them to see what the lowest/highest scores are. I will also formulate original comments to test how the API responds to positive comments containing profanity.

**6**. I began the test by using the describe command to provide data on the comments containing the word(s) that was most prevanlent in the highest rated comments. 

In [75]:
df.loc[df['comment_text'].str.contains("fuck|fucking|fucker|fucked", case=False)].describe()

Unnamed: 0,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,1163.0,1163.0,1163.0,1163.0,1163.0,1163.0,1163.0
mean,0.921169,0.885641,0.226139,0.855546,0.036973,0.658641,0.120378
std,0.105586,0.318384,0.41851,0.351701,0.188778,0.474369,0.325543
min,0.148965,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.901799,1.0,0.0,1.0,0.0,0.0,0.0
50%,0.959293,1.0,0.0,1.0,0.0,1.0,0.0
75%,0.982045,1.0,0.0,1.0,0.0,1.0,0.0
max,0.998329,1.0,1.0,1.0,1.0,1.0,1.0


*Note*: The mean score is about 0.92, which confirms that most comments containing these words are rated highly. However, there is a score minimum of about 0.15, which is an outlier. 

**7**. Next, I searched for the outlier in the DataFrame. 

In [91]:
df[df['score'] == 0.989706]

Unnamed: 0,id,comment_text,score,toxic,severe_toxic,obscene,threat,insult,identity_hate


**8**. I decided to look for the comment separately on the CSV file because the code was not revealing the comment (I suspect because the command does not work in floating numbers). 

Through this I discovered that the comment was a biography of a web author/musician. This revealed that the word was part of the title of a song, which means that it was not used within the commenters contribution. This explains why there was an outlier.

Fragment from comment: 

"Destiny Lativia Flowers(born October 19,1999), known by her stage name Cash Lady...

Music with PHG

Fuck you (2014) 

I love you (Remix of End of Time by Justin Timberlake)(2014)

Trust and Believe (Cover)(2013)

Cash Lady"


**9**. Just to make sure profanity follows a similar pattern to the first case observed, I analyzed the data around more words.

In [96]:
df.loc[df['comment_text'].str.contains("damn|cunt|bastard|bitch|dick|shit", case=False)].describe()

Unnamed: 0,score,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,1372.0,1372.0,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.826966,0.709184,0.134111,0.612245,0.02551,0.494169,0.091108
std,0.195249,0.454305,0.340896,0.487416,0.157726,0.500148,0.287867
min,0.060533,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.753805,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.902871,1.0,0.0,1.0,0.0,0.0,0.0
75%,0.966562,1.0,0.0,1.0,0.0,1.0,0.0
max,0.998136,1.0,1.0,1.0,1.0,1.0,1.0


The result of extracting other profanity is similar, with the mean reflecting a high number and the data showing a minimum score that is an outlier. 

When investigating the outlier using the CSV file (like in the previous example), it was found that there was a mistake in the identification of the word and that none of the comments containing the keywords matched the minimum score (0.060533).

**10**. Finally, I analyzed original comments to test whether profanity used in a positive context would reflect a lower score (compared to the mean).

In [102]:
comment_list = ['I fucking love you!', 'You are the shit!', 'I love you bitch!']

for comment in comment_list:
    score = get_toxicity_score(comment)
    print(comment, score)
  

I fucking love you! 0.5876557
You are the shit! 0.97820383
I love you bitch! 0.92969686


Results:
The scores for the last two comments reveal what is expected from the data. So we observe no change when those two words are used positively. However, the first example has a significantly lower score compared to the mean of the comments using the same word. This may reveal that it is a common enough phrase that the algorithm recognizes a positive intent. The score is still above a 0.5, which can lead us to conclude that profanity definately carries significant weight when determining scores.