In [5]:
# %% [markdown]
# # Manual Cleaning of Tweets Dataset
# 
# In this notebook, we'll remove tweets that contain explicit or "dirty" words. You can modify the list of banned words as needed.
# 
# **Steps:**
# 1. Load the dataset (assumed to be `tweets_master.csv`).
# 2. Define a set of banned words.
# 3. Create a function to flag tweets containing any banned word.
# 4. Filter out those tweets.
# 5. Save the cleaned dataset as a new CSV file.

# %% [code]
import pandas as pd
import re

# Define a set of banned/dirty words (modify this list as needed)
dirty_words = {
    "sex", "sexual", "nude", "explicit", "porn", "brazzers",
    "fuck", "fucking", "ass", "bitch", "slut","freaks","swingers"
}

# %% [markdown]
# ## Function to Check for Dirty Words
# This function splits the text into tokens and checks if any token is in our banned words list.

# %% [code]
def is_dirty_tweet(text, dirty_words_set):
    """
    Returns True if any banned word appears as a substring in the text.
    """
    if not isinstance(text, str):
        return False
    text_lower = text.lower()
    # Option A: check if any banned word is in the entire text:
    return any(banned_word in text_lower for banned_word in dirty_words_set)
    
    # Option B: check token-wise (less aggressive but may be adjusted)
    # tokens = re.split(r'\W+', text_lower)
    # return any(banned_word in token for token in tokens for banned_word in dirty_words_set)


# %% [markdown]
# ## Load the Dataset
# We load our master CSV file containing all tweets.

# %% [code]
# Update the path if needed
input_csv = "../data/processed/tweets_master.csv"
df = pd.read_csv(input_csv)
print("Initial number of tweets:", len(df))

# %% [markdown]
# ## Filter Out Dirty Tweets
# We apply the filter and create a new DataFrame with only clean tweets.

# %% [code]
df_clean = df[~df["text"].apply(lambda x: is_dirty_tweet(x, dirty_words))]
print("Number of tweets after filtering dirty content:", len(df_clean))

# %% [markdown]
# ## Optional: Inspect the Cleaned Data
# Let's view a few rows of the cleaned DataFrame.

# %% [code]
df_clean.head()

# %% [markdown]
# ## Save the Cleaned Dataset
# We save the cleaned tweets to a new CSV file.

# %% [code]
output_csv = "../data/processed/tweets_master_clean.csv"
df_clean.to_csv(output_csv, index=False)
print("Cleaned dataset saved to:", output_csv)


Initial number of tweets: 100
Number of tweets after filtering dirty content: 91
Cleaned dataset saved to: ../data/processed/tweets_master_clean.csv
