## Setting up a Dataframe for the rating of offensiveness
I'm using a paper released by the British telecommunications regulator Ofcom: https://www.ofcom.org.uk/__data/assets/pdf_file/0023/91625/OfcomQRG-AOC.pdf

There is a useful transcription here:
http://metro.co.uk/2016/10/02/swearing-ranked-from-mild-to-strongest-6165629/#

I created a JSON file from the data of that website.
This notebook will load this file and convert it to a Dataframe with a usable layout.

In [1]:
import json

In [2]:
with open("rating_offensiveness/data/metro_co_uk_transcription.json") as file:
    offensiveness_rating = json.load(file)
    
# the offensiveness rating is a nested JSON file
# after three levels of json objects, there is a list of words
# it looks like this:
print(offensiveness_rating["discriminatory"]["religion"]["strong"])
print(offensiveness_rating["discriminatory"]["race"]["medium"])

# there are also offensive words that are not discriminating
# for them, the second key doesn't really make sense, I just entered "offensive"
print(offensiveness_rating["non-discriminatory"]["offensive"]["mild"])

['Fenian', 'Kafir', 'Kufaar', 'Kike', 'Papist', 'Prod', 'Taig', 'Yid']
['Kraut', 'Pikey', 'Taff']
['Arse', 'Bloody', 'Bugger', 'Cow', 'Crap', 'Damn', 'Ginger', 'Git', 'God', 'Goddam', 'Jesus Christ', 'Minger', 'Sod-off']


In [3]:
# for further processing, we want to have a dataframe with the words as index
# a cell in the dataframe should be a word and the rating of offensiveness
# if the word is discriminating, we also want to have the target group

entries = []

def clean(word):
    return word.strip().lower()

# words are always three levels down in the file
for category, value in offensiveness_rating.items():
    for target, value in value.items():
        for strength, words in value.items():
            for word in words:
                entry = {}
                entry["category"] = clean(category)
                entry["word"] = clean(word)
                entry["strength"] = clean(strength)
                
                assert category in ["discriminatory", "non-discriminatory"]
                
                if category=="discriminatory":
                    entry["target"] = clean(target)
                else:
                    entry["target"] = None
                
                entries.append(entry)

In [4]:
entries[:2]

[{'category': 'non-discriminatory',
  'strength': 'mild',
  'target': None,
  'word': 'arse'},
 {'category': 'non-discriminatory',
  'strength': 'mild',
  'target': None,
  'word': 'bloody'}]

In [5]:
import pandas as pd

In [6]:
word_table = pd.DataFrame(entries)
word_table.head(5)

Unnamed: 0,category,strength,target,word
0,non-discriminatory,mild,,arse
1,non-discriminatory,mild,,bloody
2,non-discriminatory,mild,,bugger
3,non-discriminatory,mild,,cow
4,non-discriminatory,mild,,crap


In [7]:
word_table= word_table.set_index("word")
word_table.head(5)

Unnamed: 0_level_0,category,strength,target
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
arse,non-discriminatory,mild,
bloody,non-discriminatory,mild,
bugger,non-discriminatory,mild,
cow,non-discriminatory,mild,
crap,non-discriminatory,mild,


In [8]:
word_table.to_pickle("pickles/word_table.pickle")