# Keyword lookup

Since the labels collected using Google Cloud's Natural Language API don't seem very useful and meaningful, we will try to use a simpler approach: counting specific keywords that may indicate that the content is sexualized.

#### Importing packages

In [1]:
import pandas as pd
import glob
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

In [2]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rodrigo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/rodrigo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
pd.options.display.max_rows = 500

#### Reading data in

In [4]:
df = pd.read_csv("../output/dataset/complete-data.csv")

In [5]:
# How many % of the requests failed to retrieve title data?
df.full_title.isna().value_counts(normalize=True)

False    0.961432
True     0.038568
Name: full_title, dtype: float64

In [6]:
# Fill the nans with an empty string
df["full_title"] = df.full_title.fillna("")

#### Formatting

In [7]:
# Adds labels
df["racy_bool"] = df.racy.apply(lambda x: x in ["LIKELY", "VERY_LIKELY"])

#### What are we looking for?

I want to find specific keywords that may indicate sexualized content. 

This list was made initially with a short set of words that I devised from doing anedoctal Google searches, as well as with words that are more common at entries with racy images than in others. 

It was later expanded iteratively, as I noticed that more words could be useful for detecting objetifying websites.

Notice that we are looking both at the titles shown on Google images and at the domains in which the images are hosted.

First, let's find which words are more likely to appear in titles of images marked as likely/unlikely to be racy, when compared to other images. To do so, we will use something that is similar to a tf-idf.

In [8]:
def textblob_tokenizer(str_input, min_chars=2):
    '''
    A TextBlob tokenizer that splits a string
    into  words. Returns a list of words.
    
    >> Params
    
    str_input -> The text block to be tokenized
    min_chars -> How many characters a world should have to be included. Default: 2
    '''
   
    blob = TextBlob(str_input.lower())
    
    tokens = blob.words
   
    # We will ignore numbers and short words
    words = [token for token in tokens if len(token) > min_chars and not token.isnumeric()] 
    
    return words

In [9]:
def make_matrix(df, column):
    '''
    Creates a numeric matrix using the text entries of the
    dataframe and the previously defined tokenizer.
    
    >> Params
    
    df -> The dataframe containing the text that is to be tokenized
    column -> The column in the dataframe where the text is located
    '''
    

    stop_words = nltk.corpus.stopwords.words('english')
        
    vec = CountVectorizer(tokenizer=textblob_tokenizer, stop_words=stop_words)
    
    matrix = vec.fit_transform(df[column])
    
    matrix = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

    return matrix


In [10]:
def transpose(df):
    '''
    Uses the built-in pandas method to transpose
    the axes of the dataframe passed as a parameter.
    
    >> Params
    
    df -> The dataframe to be transposed
    '''
        
    return df.transpose().reset_index()

In [11]:
def get_sums(df):
    '''
    The function counts how many times a word has 
    been used on the dataframe. It is an envelope for
    the df.sum() method, also built-in in Pandas.
    
    >> Params
    
    df -> The dataframe with the words that must be added.
    '''
        
    df['WORD_SUM'] = df.sum(axis=1)
    
    # Esse valor será o mesmo ao longo de todo o dataframe
    df['TOTAL_SUM'] = df.WORD_SUM.sum(axis=0)
    
    # Mantém apenas as colunas de interesse
    df = df[['index', 'WORD_SUM', 'TOTAL_SUM']]
        
    return df


In [12]:
def get_ratios(df, multiplier=100):
    '''
    Based on the sums obtained get_sum(df), this function
    calculates the usage ratio of each word.
    
    
    >> Params
    
    df -> The dataframe with word count information
    multiplier -> An integer that will be the multiplier of the ratio.
    For example, 1 occurrence for every 10,000 or 20,000 words. Default: 100
    '''
        
    df['WORD_RATIO'] = (df['WORD_SUM'] / df['TOTAL_SUM']) * multiplier
    
    return df



In [13]:
def merge_dfs(racy_df, not_racy_df):
    '''
    Merges the dataframe with the tokens and ratios for
    the racy and not racy content into a single dataframe.
    
    
    >> Params
    
    racy_df-> the df on the left of merge, with the
    ratios for each word in images tagged as racy
    
    not_racy_df -> the df on the right, with ratios
    for each word in all remaining images
    '''
    
    suffixes = [ "_RACY", "_NOT_RACY" ]
    
    # Merges both dataframes on the tokens
    df = racy_df.merge(not_racy_df, on='index', how='outer', suffixes=suffixes)
    
    return df

In [14]:
def get_probabilities(df):
    '''
    Calculate the difference between the usage ratios
    of a given word in content that is marked as racy
    and in content that is not. That is, here we find
    the words that are more associated with racy pics.
    
    >> Params
    
    df -> The df resulting from merge_dfs()   
    '''
        
    df['TIMES_MORE_LIKELY_RACY']  = df["WORD_RATIO_RACY"] / df["WORD_RATIO_NOT_RACY"]
    df['TIMES_MORE_LIKELY_NOT_RACY'] = df["WORD_RATIO_NOT_RACY"] / df["WORD_RATIO_RACY"]

    return df



In [15]:
def calculate(df):
    '''
    Wrapper function that calls all the functions
    defined above and returns a df with the words
    that are more likely to appear in racy content.
    '''
    
    # Creates two different dfs
    racy_df = df[df.racy_bool]
    not_racy_df = df[~df.racy_bool]
    
    racy_df = make_matrix(racy_df, "title")
    not_racy_df = make_matrix(not_racy_df, "title")
        
    racy_df = transpose(racy_df)
    not_racy_df = transpose(not_racy_df)
            
    racy_df = get_sums(racy_df)
    not_racy_df = get_sums(not_racy_df)
    
    # Keep only words with a minimum frequency
    racy_df = racy_df[racy_df.WORD_SUM > 10]
    not_racy_df = not_racy_df[not_racy_df.WORD_SUM > 10]

    
    racy_df = get_ratios(racy_df, 10000)
    not_racy_df = get_ratios(not_racy_df, 10000)
        
    results = merge_dfs(racy_df, not_racy_df)
    
    results = get_probabilities(results)
    
    # Round
    results = results.round(decimals=5)

    # Removes words that weren't present in both dfs
    results = results.dropna()
    
    return results



In [16]:
%%time
results = calculate(df)

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.78 µs


Which words are more likely to be associated with racy pictures?

In [17]:
results.sort_values(by="TIMES_MORE_LIKELY_RACY", ascending=False).head(100)[["index", "TIMES_MORE_LIKELY_RACY"]]

We will build our list of words to look for with some of those, filtering out the words that indicate nationality and expanding it with some words variations of those terms. We also didn't use 'beauty' or 'beautiful' because, although associated with racy pictures, those words were producing lots of false negatives.

In [18]:
keywords = [
    "sexy", "hot", "hottest",
    #"beauty", "beautiful",
    "sex", "laid", "fuck",
    "marry", "marriage", "bride", "brides", "wife", "wives", "mail", "order", 
    "dating", "date", "meet", "single",
]

In [19]:
# Adds word boundaries:
keywords = [r"\b" + word + r"\b" for word in keywords]

Now we can mark all the titles that contain such words.

In [20]:
capture = re.compile("|".join(keywords))

In [21]:
df["keywords_bool"] = df.full_title.str.lower().apply(lambda x: True if capture.search(x) else False)

In [22]:
# How many entries have the keyword?
df.keywords_bool.value_counts()

False    17792
True      2769
Name: keywords_bool, dtype: int64

In [23]:
# Save the output
df.to_csv("../output/dataset/complete-data-classified.csv", index=False)