<a href="https://colab.research.google.com/github/dasipsita/data-science/blob/main/Tweet_Profanity_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook identifies profanity in tweets.

1. A file with tweets is imported
2. The tweets are then cleaned to remove noise from the tweets
3. A list of racial slur words is imported
4. Each tweet in the imported file is then checked for racial slur words in the list
5. If racial slur words are found, the based on number of racial slur words contained in the tweet, a profanity score is assigned to the tweet

# Import packages

In [None]:
import pandas as pd
import re

# Calculate the degree of profanity for each tweet in a file

## Import and clean tweets

### Import tweets from file and store in pandas dataframe

In [None]:
tweets_df = pd.read_csv('/content/drive/MyDrive/Datasets/twitter_file.csv', names=['tweets'])

### Clean tweets and store in pandas dataframe

In [None]:
# Function to clean tweets
def clean_tweet(tweet):

  if type(tweet) == float:
    return ""

  # Remove re-tweet tags (RT)
  temp_tweet = re.sub("RT", " ", tweet)

  # Convert tweets to lower case
  temp_tweet = temp_tweet.lower()

  # Remove hashtags and mentions from tweets
  temp_tweet = re.sub("@[A-Za-z0-9_]+"," ", temp_tweet)
  temp_tweet = re.sub("#[A-Za-z0-9_]+"," ", temp_tweet)

  # Remove links from tweets
  temp_tweet = re.sub(r"http\S+", " ", temp_tweet)
  temp_tweet = re.sub(r"www.\S+", " ", temp_tweet)

  # Remove punctuations
  temp_tweet = re.sub('[()!?]', ' ', temp_tweet)
  temp_tweet = re.sub('\[.*?\]',' ', temp_tweet)

  # Remove non-alphanumeric characters
  temp_tweet = re.sub("[^a-z0-9]"," ", temp_tweet)

  return temp_tweet

In [None]:
tweets_df['cleaned_tweets'] = tweets_df['tweets'].apply(clean_tweet)
tweets_df

Unnamed: 0,tweets,cleaned_tweets,tweet_tokens
0,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldn t complain abo...,"[as, a, woman, you, shouldn, t, complain, abou..."
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats cold tyga dwn bad for cu...,"[boy, dats, cold, tyga, dwn, bad, for, cuffin,..."
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,dawg you ever fuck a bitc...,"[dawg, you, ever, fuck, a, bitch, and, she, st..."
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,she look like a tranny,"[she, look, like, a, tranny]"
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,the shit you hear about me ...,"[the, shit, you, hear, about, me, might, be, t..."
...,...,...,...
26217,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,you s a muthaf in lie right his t...,"[you, s, a, muthaf, in, lie, right, his, tl, i..."
26218,you've gone and broke the wrong heart baby,you ve gone and broke the wrong heart baby,"[you, ve, gone, and, broke, the, wrong, heart,..."
26219,young buck wanna eat!!.. dat nigguh like I ain...,young buck wanna eat dat nigguh like i ain...,"[young, buck, wanna, eat, dat, nigguh, like, i..."
26220,youu got wild bitches tellin you lies,youu got wild bitches tellin you lies,"[youu, got, wild, bitches, tellin, you, lies]"


## Tokenize tweets

In [None]:
# Function to tokenize tweets
def tokenize_tweets(tweet):
  return tweet.split()

In [None]:
tweets_df['tweet_tokens'] = tweets_df['cleaned_tweets'].apply(tokenize_tweets)
tweets_df

Unnamed: 0,tweets,cleaned_tweets,tweet_tokens
0,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldn t complain abo...,"[as, a, woman, you, shouldn, t, complain, abou..."
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats cold tyga dwn bad for cu...,"[boy, dats, cold, tyga, dwn, bad, for, cuffin,..."
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,dawg you ever fuck a bitc...,"[dawg, you, ever, fuck, a, bitch, and, she, st..."
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,she look like a tranny,"[she, look, like, a, tranny]"
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,the shit you hear about me ...,"[the, shit, you, hear, about, me, might, be, t..."
...,...,...,...
26217,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,you s a muthaf in lie right his t...,"[you, s, a, muthaf, in, lie, right, his, tl, i..."
26218,you've gone and broke the wrong heart baby,you ve gone and broke the wrong heart baby,"[you, ve, gone, and, broke, the, wrong, heart,..."
26219,young buck wanna eat!!.. dat nigguh like I ain...,young buck wanna eat dat nigguh like i ain...,"[young, buck, wanna, eat, dat, nigguh, like, i..."
26220,youu got wild bitches tellin you lies,youu got wild bitches tellin you lies,"[youu, got, wild, bitches, tellin, you, lies]"


## Profanity Check

### Define Profane Words

In [170]:
# Defining a sample list of profanity words for testing the code
sample_profanity_word_list = [
    'shit',
    'fuck',
    'idiot',
    'asshole',
    'fuckall',
    'pussy',
    'fucking',
    'bootlicker',
    'slut',
    'terrorist',
    'prostitute',
    'whore',
    'slave',
    'sucks',
    'whitetrash',
    'nigga',
    'moron',
    'negro',
    'bitch',
    'sissy',
    'dawg',
    'hoes']

### Check for profanity for each tweet

The degree of profanity is calculated through a basic profanity score metric, to keep all profanity scores between 0-1.

Given a tweet and a list of racial slur words, the profanity score is calculated as:

Profanity Score = Number of profane words in the tweet / Number of total words in the tweet

In [172]:
# Fuction to calculate degree of profanity (profanity_score) for tweets
def get_profanity_score(tweet_tokens, profane_word_list):
  num_tokens = len(tweet_tokens)
  num_profane_words = 0
  profanity_score = 0
  for t in tweet_tokens:
    if t in profane_word_list:
      num_profane_words += 1
  if num_profane_words > 0:
    profanity_score = (num_profane_words/num_tokens)

  return round(profanity_score, 2)



In [173]:
tweets_df['profanity_score'] = tweets_df['tweet_tokens'].apply(get_profanity_score, profane_word_list=sample_profanity_word_list)
tweets_df


Unnamed: 0,tweets,cleaned_tweets,tweet_tokens,profanity_score
0,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldn t complain abo...,"[as, a, woman, you, shouldn, t, complain, abou...",0.00
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats cold tyga dwn bad for cu...,"[boy, dats, cold, tyga, dwn, bad, for, cuffin,...",0.00
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,dawg you ever fuck a bitc...,"[dawg, you, ever, fuck, a, bitch, and, she, st...",0.25
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,she look like a tranny,"[she, look, like, a, tranny]",0.00
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,the shit you hear about me ...,"[the, shit, you, hear, about, me, might, be, t...",0.09
...,...,...,...,...
26217,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,you s a muthaf in lie right his t...,"[you, s, a, muthaf, in, lie, right, his, tl, i...",0.00
26218,you've gone and broke the wrong heart baby,you ve gone and broke the wrong heart baby,"[you, ve, gone, and, broke, the, wrong, heart,...",0.00
26219,young buck wanna eat!!.. dat nigguh like I ain...,young buck wanna eat dat nigguh like i ain...,"[young, buck, wanna, eat, dat, nigguh, like, i...",0.00
26220,youu got wild bitches tellin you lies,youu got wild bitches tellin you lies,"[youu, got, wild, bitches, tellin, you, lies]",0.00


# Calculate the degree of profanity for a single input tweet

In [175]:
tweets_df.to_csv('/content/drive/MyDrive/Datasets/curated_twitter_file.csv')

In [174]:
tweet_tok = ['i', 'met', 'that', 'pussy', 'on', 'ocean', 'dr', 'i', 'gave', 'that', 'pussy', 'a', 'pill']
score = get_profanity_score(tweet_tok, sample_profanity_word_list)
score

0.15