### Checking our list of swearwords

The musixmatch database contains lyrics for a selection of tracks from the million-song-database.

But the lyrics are only given by wordcount. So for any given song, we don't have the actual text, but only the number of times that the singer has said "I" or "you".

This makes the rating of offensiveness much harder, since we loose all context.
For example: "Jesus Christ" and "god" are rated as mildly offensive and non-discriminatory. If you use them as expletive, that might be correct. But we still don't want to classify gospel songs as outliers with a high frequency of mild swearing.

In this notebook we load all words that are present in the lyrics and check which of them are covered in our list of swearwords.
Then we go through the lyrics again, to check if our swearwords are missing something.

In [1]:
import sqlite3

In [2]:
conn = sqlite3.connect("datasets/mxm_dataset.db")

In [3]:
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[('words',), ('lyrics',)]


In [4]:
cursor = conn.cursor()
cursor.execute("SELECT * FROM words;")
words = cursor.fetchall()
words[:5]

[('i',), ('the',), ('you',), ('to',), ('and',)]

In [5]:
import pandas as pd

In [6]:
word_table = pd.read_pickle("pickles/word_table.pickle")
word_table.head()

Unnamed: 0_level_0,category,strength,target
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
arse,non-discriminatory,mild,
bloody,non-discriminatory,mild,
bugger,non-discriminatory,mild,
cow,non-discriminatory,mild,
crap,non-discriminatory,mild,


In [7]:
for word in words:
    if word[0] in word_table.index:
        print(word)

('god',)
('fuck',)
('shit',)
('damn',)
('ho',)
('bitch',)
('special',)
('dick',)
('whore',)
('mental',)
('bastard',)
('negro',)
('bullshit',)
('cock',)
('cow',)
('slut',)
('psycho',)
('cunt',)
('crap',)


In [8]:
# I read all the words in the database
# this is a manual list of swearwords
# please note that I'm not a native speaker, it is possible that I've missed something

# this is a list of words that are with very high likelihood used as swearwords,
# or that are obviously obscene.
offensive_list = ["fuck", "fool", "shit", "nigga", "damn", "bitch", "ass", "fuckin", "freak", "motherfuck", "rape", "dick", "whore", "bastard", "sucker", "pussi",
"bum", "gay", "cock", "jerk", "cunt", "junk", "motherfuckin", "crap"]
         
# the second list of words is colloquial
# they might be offensive, but it's not really clear without context
colloquial_list = ["trippin","thug","gangsta","gypsi","booti","junki","shorti"]
                  
# then we have a third list
# I've only filled this with exemplary samples
# the third list contains very negative words
# but these words are not offensive
negative_list = ["slave"," satan","suicid","cocain","genocid"]

## Cleaning the swearwords

Based on the data above, we have to clear our list of swearwords.
We remove
 - negro, because that is probably just Spanish
 - god, because this is only offensive in context
 
The words found above, that are not in the list of swearwords will be added based on their similarity to existing words.
For most additions, the choice of strength and category seems easy. Here is a list of choices that might be discussed:

- "freak", similar to "retard" or "spastic"
- "jerk", similar to "bastard"
- "booti", similar to "arse"

We also add some words from the colloquial list
 - gypsi
 - shorti

In [9]:
word_table = word_table.drop("negro")
word_table = word_table.drop("god")


In [10]:
# fuck exists

# we add fool, similar to "loony", "mental" 
word_table.loc["fool"]={"category":"discriminative", "strength":"mild", "target":"mental or physical ability"}

# shit exists

# we add nigga, similar to nigger
word_table.loc["nigga"]={"category":"discriminative", "strength":"strongest", "target":"race"}

# damn exists
# bitch exists

# we add ass, similar to arse
word_table.loc["ass"]={"category":"non-discriminatory", "strength":"mild", "target":None}

# we add fuckin, similar to fuck
word_table.loc["fuckin"]={"category":"non-discriminatory", "strength":"strongest", "target":None}

# we add freak, similar to retard
word_table.loc["freak"]={"category":"discriminative", "strength":"strongest", "target":"mental or physical ability"}

# we add motherfuck and motherfuckin, similar to motherfucker
word_table.loc["motherfuck"]={"category":"non-discriminatory", "strength":"strongest", "target":None}
word_table.loc["motherfuckin"]={"category":"non-discriminatory", "strength":"strongest", "target":None}

# we add rape, similar to rapey
word_table.loc["rape"]={"category":"non-discriminatory", "strength":"strongest", "target":None}

# dick already exists
# whore already exists


# we add bum, similar to arse
word_table.loc["bum"]={"category":"non-discriminatory", "strength":"mild", "target":None}

# we add gay, similar to homo
word_table.loc["gay"]={"category":"discriminative", "strength":"strong", "target":"sexuality"}

# we add cock, similar to dick
word_table.loc["cock"]={"category":"non-discriminatory", "strength":"strong", "target":None}

# we add jerk, similar to bastard
word_table.loc["jerk"]={"category":"non-discriminatory", "strength":"strong", "target":None}

# we add junk, similar to Crap
word_table.loc["junk"]={"category":"non-discriminatory", "strength":"mild", "target":None}

# we add booty, similar to arse
word_table.loc["booti"]={"category":"non-discriminatory", "strength":"mild", "target":None}

In [11]:
# we add shorti, mildly discriminative against women
word_table.loc["shorti"]={"category":"discriminative", "strength":"mild", "target":"sexuality"}

# we add gypsi, medium discriminative against gypsies
word_table.loc["gypsi"]={"category":"discriminative", "strength":"medium", "target":"race"}

## Checking occurence of words again
We go through all words that occur in the lyrics and check against our list of swearwords.
The resulting list should only contain words that are obviously swearwords.

We save the resulting dataframe

In [12]:
for word in words:
    if word[0] in word_table.index:
        print(word)

('fuck',)
('fool',)
('shit',)
('nigga',)
('damn',)
('ho',)
('bitch',)
('ass',)
('special',)
('fuckin',)
('freak',)
('motherfuck',)
('rape',)
('dick',)
('whore',)
('mental',)
('bastard',)
('bum',)
('bullshit',)
('gay',)
('cock',)
('gypsi',)
('cow',)
('jerk',)
('booti',)
('shorti',)
('slut',)
('psycho',)
('cunt',)
('junk',)
('motherfuckin',)
('crap',)


In [13]:
word_table.to_pickle("pickles/word_table_cleaned.pickle")