<a href="https://colab.research.google.com/github/farhadmohmand66/sentiment_analysis_of_subreddits/blob/main/sentiement_analysis_reddit_confession.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Find Juicy Confession from Subreddits by Sentiment Analysis

**Initial Requirement:** The primary goal of this notebook was to identify "juicy" confessions from a dataset of Reddit posts. "Juicy" was initially defined subjectively, aiming to find content related to secrets, controversial topics, or emotionally charged situations.

**Steps Taken and Achievements:**

1.  **Data Loading and Cleaning:** We successfully loaded the raw data from a CSV file into a pandas DataFrame. Initial exploration using `df.head()` and `df.info()` provided an understanding of the dataset's structure and identified the presence of duplicate titles and missing values in the 'author' column. We then addressed duplicate titles by removing them, resulting in a cleaner dataset. A custom `clean_text` function was developed and applied to both 'title' and 'content' columns, removing emojis, special symbols, markdown, and normalizing whitespace while preserving essential punctuation like apostrophes and parentheses.

2.  **Keyword-Based Filtering:** To make an initial selection of potentially "juicy" content, we defined a list of keywords associated with sensitive or controversial topics. A boolean mask was created to filter the DataFrame, keeping only the rows where the 'content' contained at least one of these keywords (case-insensitive). This step resulted in the `juicy_confessions_df`, a subset of the original data likely containing relevant confessions.

3.  **Sentiment Analysis Refining:** We further refined the selection by applying sentiment analysis to the cleaned titles of the keyword-filtered confessions. Using the `distilbert-base-uncased-finetuned-sst-2-english` model, we analyzed the sentiment of each `cleaned_title`, obtaining a sentiment label ('POSITIVE' or 'NEGATIVE') and a confidence score. We then created boolean masks to identify confessions with *strong* sentiment (either positive or negative) based on a defined threshold (0.8).

4.  **Final Filtering and Saving:** The final `juicy_confessions_sentiment_df` DataFrame was created by filtering the `juicy_confessions_df` using the combined sentiment mask. This DataFrame contains only those confessions that not only matched the initial keyword criteria but also had a cleaned title exhibiting strong positive or negative sentiment. We also counted the distribution of negative and positive sentiments in this final set. Finally, this curated DataFrame, representing our identified "juicy" confessions, was successfully saved to a CSV file named `juicy_confessions_sentiment.csv` in Google Drive, using `index=False` and `encoding='utf-8-sig'` for optimal format and compatibility.

**Overall Achievement:** By combining data cleaning, a keyword-based filter, and a sentiment-based filter on the title's emotional intensity, we successfully extracted a refined dataset of confessions from the initial large collection. This final dataset, saved as `juicy_confessions_sentiment.csv`, represents the posts most likely to fit the description of "juicy" according to our defined criteria and the sentiment analysis model's predictions.
"""

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import re
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/reddite_confessions.csv')
print(df.head())

  subreddit                                              title  \
0  adultery     I'm cooked, chat🌬️Ventilation💨 (self.adultery)   
1  adultery       Thank you chat🙌✨Good Vibes✨🙌 (self.adultery)   
2  adultery  Why does this hurt so much?😩Donezo🥩 x 🌬️Ventil...   
3  adultery  I miss you so much - after 6 months and after ...   
4  adultery  Tinder video selfie verification?💻Hello IT?📞 (...   

                                             content              author  \
0  I am reeling. It wasn't supposed to be like th...    onmykneesdarling   
1  Two years ago I met an amazing women on AM, we...           Jgords235   
2  I never thought I’d be here…writing something ...  Visible_Fault_6070   
3  Dear You, I miss you. I miss you so much. The ...     Momoparadise619   
4  So I decided to roll up a Tinder account to se...      GentlemanDom72   

                                                 url  
0  https://old.reddit.com/r/adultery/comments/1lg...  
1  https://old.reddit.com/r/adultery

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  878 non-null    object
 1   title      878 non-null    object
 2   content    878 non-null    object
 3   author     785 non-null    object
 4   url        878 non-null    object
dtypes: object(5)
memory usage: 34.4+ KB


In [6]:
# check duplicate titles
duplicate_titles = df[df.duplicated('title', keep=False)]

if not duplicate_titles.empty:
  print("\nDuplicate titles found:", len(duplicate_titles))
else:
  print("\nNo duplicate titles found.")


Duplicate titles found: 144


In [7]:
df = df.drop_duplicates(subset='title', keep='first')
print("\nDataFrame after dropping duplicate titles:")

# Verify that duplicates have been dropped
duplicate_titles_after_drop = df[df.duplicated('title', keep=False)]
print("\nNumber of duplicate titles after drop:", len(duplicate_titles_after_drop))


DataFrame after dropping duplicate titles:

Number of duplicate titles after drop: 0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 806 entries, 0 to 877
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  806 non-null    object
 1   title      806 non-null    object
 2   content    806 non-null    object
 3   author     718 non-null    object
 4   url        806 non-null    object
dtypes: object(5)
memory usage: 37.8+ KB


    Clean text while preserving apostrophes, parentheses, and essential punctuation.
    Removes emojis, special symbols, and markdown formatting. Clean the text so that it convert to another language easily

In [15]:
def clean_text(text):
    """
    Clean text while preserving apostrophes, parentheses, and essential punctuation.
    Removes emojis, special symbols, and markdown formatting.
    Adds spaces around removed emojis/symbols to prevent word merging.
    Removes extra dots and normalizes whitespace.
    """
    if not text:
        return ""

    # Step 1: Replace problematic apostrophes and quotes
    text = text.replace("’", "'")  # Replace curly apostrophe with straight
    text = text.replace("‘", "'")  # Replace left single quotation mark
    text = text.replace("“", '"')  # Replace left double quotation mark
    text = text.replace("”", '"')  # Replace right double quotation mark

    # Step 2: Remove emojis and special symbols (except parentheses and basic punctuation)
    # Keep: letters, numbers, basic punctuation, parentheses, apostrophes
    # This regex removes characters that are NOT:
    # \w (word characters: a-z, A-Z, 0-9, _)
    # \s (whitespace characters)
    # .,!?\'"()&$%\-–—:;+=@#*/\| (specific allowed punctuation and symbols)
    # By removing anything *not* in this set, it should remove emojis and other symbols.
    # We also add spaces around the removed characters to prevent word merging.
    text = re.sub(r'([^\w\s.,!?\'"()&$%\-–—:;+=@#*/\|])', r' ', text)


    # Step 3: Clean up markdown formatting
    text = re.sub(r'\*{2}(.*?)\*{2}', r'\1', text)  # Remove bold formatting
    text = re.sub(r'\*(.*?)\*', r'\1', text)         # Remove italic formatting
    text = re.sub(r'~~(.*?)~~', r'\1', text)         # Remove strikethrough
    text = re.sub(r'^>.*$', '', text, flags=re.MULTILINE)  # Remove blockquotes

    # Step 4: Remove Reddit-specific placeholders
    text = re.sub(r'\[deleted\]|\[removed\]', '', text, flags=re.IGNORECASE)

    # Step 5: Remove extra dots (e.g., ...) and normalize spaces
    text = re.sub(r'\.{2,}', '.', text)  # Replace two or more consecutive dots with a single dot
    text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace and trim

    return text

In [18]:
# testing the function hot it clea
print(clean_text("I'm cooked 6, chat🌬️Ventilation💨 (self.adultery)"))
# Output: "I'm cooked, chatVentilation (self.adultery)"

print(clean_text("Why does this hurt so much?😩Donezo🥩 x 🌬️Ventil..."))
# Output: "Why does this hurt so much?Donezo x Ventil..."

print(clean_text("Tinder video selfie verification?💻Hello IT?📞 (."))
# Output: "Tinder video selfie verification?Hello IT? (."

I'm cooked 6, chat Ventilation (self.adultery)
Why does this hurt so much? Donezo x Ventil.
Tinder video selfie verification? Hello IT? (.


In [19]:
df['cleaned_title'] = df['title'].apply(clean_text)

print("\nDataFrame with cleaned titles:")
print(df[['title', 'cleaned_title']].head(2))


DataFrame with cleaned titles:
                                            title  \
0  I'm cooked, chat🌬️Ventilation💨 (self.adultery)   
1    Thank you chat🙌✨Good Vibes✨🙌 (self.adultery)   

                                  cleaned_title  
0  I'm cooked, chat Ventilation (self.adultery)  
1     Thank you chat Good Vibes (self.adultery)  


In [21]:
df['cleaned_content'] = df['content'].apply(clean_text)

In [23]:
# df.head(2)

I just need to find contents that deal with "confessions" (any topic, but should be "juicy")

In [25]:
# To find content related to "confessions" and potentially "juicy" topics,
# we can filter the DataFrame based on keywords in the title or body of the posts.
# "Juicy" is subjective, so we'll look for common themes that might be considered juicy
# within confessions, such as infidelity, secrets, controversial actions, etc.

# Let's define a list of keywords that might indicate a "juicy" confession.
# This is a starting point and can be expanded based on specific interests.
juicy_keywords = [
    "cheat", "affair", "lie", "lied", "steal", "stole", "betray", "secret", "regret",
    "guilt", "shame", "hidden", "confess", "stolen", "deceive", "manipulate", "obsess",
    "addict", "addiction", "obsession", "illegal", "crime", "criminal", "police",
    "murder", "kill", "drug", "alcohol", "sex", "sexual", "relationship", "affair",
    "pregnant", "abortion", "divorce", "marriage", "fired", "job", "work", "money"
]

# We can search for these keywords in the 'content' column of the DataFrame.
# We will perform a case-insensitive search.

text_column_for_keywords = 'content'

if text_column_for_keywords in df.columns:
    # Create a boolean mask where any of the juicy keywords are present in the text column
    # We use | (OR) to combine the conditions for each keyword
    keyword_mask = df[text_column_for_keywords].str.contains('|'.join(juicy_keywords), case=False, na=False)

    # Filter the DataFrame using the mask
    juicy_confessions_df = df[keyword_mask].copy() # Use .copy() to avoid SettingWithCopyWarning

    print(f"\nFound {len(juicy_confessions_df)} potential juicy confessions based on keywords in the '{text_column_for_keywords}' column.")

    # Display the first few rows of the potentially juicy confessions
    print(juicy_confessions_df[['title', text_column_for_keywords]].head())

    # You can further inspect the contents of juicy_confessions_df
    # For example, to see the full text of a confession:
    # for index, row in juicy_confessions_df.head().iterrows():
    #    print("\n--- Title ---")
    #    print(row['title'])
    #    if text_column in row:
    #        print("\n--- Body ---")
    #        print(row[text_column])
    #    print("-" * 20)
else:
    print(f"\nWarning: Text column '{text_column_for_keywords}' not found in the DataFrame.")


Found 647 potential juicy confessions based on keywords in the 'content' column.
                                               title  \
0     I'm cooked, chat🌬️Ventilation💨 (self.adultery)   
1       Thank you chat🙌✨Good Vibes✨🙌 (self.adultery)   
3  I miss you so much - after 6 months and after ...   
4  Tinder video selfie verification?💻Hello IT?📞 (...   
5  AP is buying a house, mixed emotions🌬️Ventilat...   

                                             content  
0  I am reeling. It wasn't supposed to be like th...  
1  Two years ago I met an amazing women on AM, we...  
3  Dear You, I miss you. I miss you so much. The ...  
4  So I decided to roll up a Tinder account to se...  
5  I have been with my AP for more than 6 years n...  


In [26]:
juicy_confessions_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 647 entries, 0 to 877
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   subreddit        647 non-null    object
 1   title            647 non-null    object
 2   content          647 non-null    object
 3   author           583 non-null    object
 4   url              647 non-null    object
 5   cleaned_title    647 non-null    object
 6   cleaned_content  647 non-null    object
dtypes: object(7)
memory usage: 40.4+ KB


In [28]:
# Sort by the current index (which are the original indices from the full DataFrame)
juicy_confessions_df = juicy_confessions_df.sort_index()

# Reset the index to a new sequential index starting from 0
juicy_confessions_df = juicy_confessions_df.reset_index(drop=True)

print("\nDataFrame after sorting and resetting index:")
print(juicy_confessions_df.info())


DataFrame after sorting and resetting index:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647 entries, 0 to 646
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   subreddit        647 non-null    object
 1   title            647 non-null    object
 2   content          647 non-null    object
 3   author           583 non-null    object
 4   url              647 non-null    object
 5   cleaned_title    647 non-null    object
 6   cleaned_content  647 non-null    object
dtypes: object(7)
memory usage: 35.5+ KB
None


In [41]:
# !pip install transformers

apply sentiment analyssi df_cleaned[[ 'cleaned_title'] to find juicy content

### The distilbert-base-uncased-finetuned-sst-2-english model, when used for sentiment analysis, outputs a probability distribution over the possible sentiment labels (in this case, 'POSITIVE' and 'NEGATIVE').

Here's how it works:

Processing the Text: The model takes the input text (like a cleaned title or content) and processes it through its layers.
Output Layer: The final layer of the model is configured for classification and outputs a value for each possible label.
Softmax Activation: A softmax function is applied to these output values. Softmax converts the values into probabilities that sum up to 1. For binary classification, you'll get two probabilities: one for 'POSITIVE' and one for 'NEGATIVE'.
Assigning Label and Score:
The sentiment_label assigned to a text is the label ('POSITIVE' or 'NEGATIVE') with the higher probability.
The sentiment_score is the probability associated with the assigned label. For example, if the model outputs a probability of 0.95 for 'NEGATIVE' and 0.05 for 'POSITIVE', the sentiment_label will be 'NEGATIVE' and the sentiment_score will be 0.95.
So, a score closer to 1 indicates a higher confidence in the assigned sentiment label, while a score closer to 0 indicates lower confidence. For this binary model, a score around 0.5 would suggest the model is uncertain between positive and negative sentiment.
- juicy_sentiment_mask_negative. This mask is used to identify rows in the juicy_confessions_df DataFrame that meet two conditions:

juicy_confessions_df['sentiment_label'] == 'NEGATIVE': This part checks if the value in the 'sentiment_label' column for each row is equal to the string 'NEGATIVE'.
juicy_confessions_df['sentiment_score'] > sentiment_threshold: This part checks if the value in the 'sentiment_score' column for each row is greater than the value stored in the sentiment_threshold variable.
The & symbol between the two conditions means that both conditions must be True for a row to be considered True in the juicy_sentiment_mask_negative. In other words, this mask will be True for rows where the sentiment analysis classified the text as 'NEGATIVE' AND the confidence score for that negative sentiment is above the specified threshold. This helps in filtering for confessions that are strongly negative.

In [30]:
from transformers import pipeline
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Apply sentiment analysis to the 'cleaned_title' column
# The pipeline returns a list of dictionaries, each containing 'label' (POSITIVE/NEGATIVE) and 'score'
sentiment_results = sentiment_analyzer(juicy_confessions_df['cleaned_title'].tolist())

# Extract labels and scores into new columns
juicy_confessions_df['sentiment_label'] = [res['label'] for res in sentiment_results]
juicy_confessions_df['sentiment_score'] = [res['score'] for res in sentiment_results]

# Now you can filter based on sentiment.
# "Juicy" content might often have a strong negative sentiment (regret, shame, etc.)
# or sometimes a strong positive sentiment (excitement, perhaps?).
# This is subjective, but we can start by looking for strong sentiment scores.

# Let's define a threshold for strong sentiment. For binary sentiment, a score close to 1 is strong.
sentiment_threshold = 0.8 # Adjust this based on experimentation

"""juicy_sentiment_mask_negative: This mask is a pandas Series of boolean values (True or False).
 It is created to identify rows in the DataFrame where the sentiment analysis of the cleaned text
 resulted in a 'NEGATIVE' label with a sentiment score greater than a specified sentiment_threshold.
 In essence, it flags the confessions that the model is confident are strongly negative.
juicy_sentiment_mask_positive: Similarly, this mask is a pandas Series of boolean values.
 It identifies rows where the sentiment analysis resulted in a 'POSITIVE' label with a
  sentiment score greater than the same sentiment_threshold. This flags the confessions
  that the model is confident are strongly positive."""

juicy_sentiment_mask_negative = (juicy_confessions_df['sentiment_label'] == 'NEGATIVE') & (juicy_confessions_df['sentiment_score'] > sentiment_threshold)

# Or filter for positive sentiment with a high score (e.g., exciting confession)
juicy_sentiment_mask_positive = (juicy_confessions_df['sentiment_label'] == 'POSITIVE') & (juicy_confessions_df['sentiment_score'] > sentiment_threshold)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


creating a new boolean mask called juicy_sentiment_mask by combining the juicy_sentiment_mask_negative and juicy_sentiment_mask_positive masks using the bitwise OR operator (|).

Here's what that means:

The | operator performs a logical OR operation element-wise between the two masks.
The resulting juicy_sentiment_mask will have a True value for a row if either juicy_sentiment_mask_negative is True for that row or juicy_sentiment_mask_positive is True for that row (or both).

In [32]:
# Combine masks if you want both strongly positive and strongly negative
juicy_sentiment_mask = juicy_sentiment_mask_negative | juicy_sentiment_mask_positive

In [37]:
print("Count of True:", juicy_sentiment_mask.sum())
print("Count of False:", (~juicy_sentiment_mask).sum())

Count of True: 625
Count of False: 22


In [34]:
juicy_confessions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 647 entries, 0 to 646
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   subreddit        647 non-null    object 
 1   title            647 non-null    object 
 2   content          647 non-null    object 
 3   author           583 non-null    object 
 4   url              647 non-null    object 
 5   cleaned_title    647 non-null    object 
 6   cleaned_content  647 non-null    object 
 7   sentiment_label  647 non-null    object 
 8   sentiment_score  647 non-null    float64
dtypes: float64(1), object(8)
memory usage: 45.6+ KB


In [38]:
# juicy_confessions_df.head(3)

juicy_confessions_df DataFrame based on the juicy_sentiment_mask and creating a new DataFrame called juicy_confessions_sentiment_df.

Here's a breakdown:

juicy_confessions_df[juicy_sentiment_mask]: This is using boolean indexing (also known as boolean selection) in pandas. The juicy_sentiment_mask is a Series of boolean values (True or False) with the same index as juicy_confessions_df. When you use a boolean Series to index a DataFrame, pandas returns only the rows where the boolean Series has a value of True. In this case, it selects the rows from juicy_confessions_df where the juicy_sentiment_mask is True (i.e., the rows with strong positive or negative sentiment).

In [39]:
# Filter the DataFrame based on sentiment mask
juicy_confessions_sentiment_df = juicy_confessions_df[juicy_sentiment_mask].copy()

print(f"\nFound {len(juicy_confessions_sentiment_df)} potentially juicy confessions based on sentiment.")

# Display the first few rows of the potentially juicy confessions identified by sentiment
print(juicy_confessions_sentiment_df[['title', 'cleaned_title', 'sentiment_label', 'sentiment_score']].head())


Found 625 potentially juicy confessions based on sentiment.
                                               title  \
0     I'm cooked, chat🌬️Ventilation💨 (self.adultery)   
1       Thank you chat🙌✨Good Vibes✨🙌 (self.adultery)   
2  I miss you so much - after 6 months and after ...   
3  Tinder video selfie verification?💻Hello IT?📞 (...   
4  AP is buying a house, mixed emotions🌬️Ventilat...   

                                       cleaned_title sentiment_label  \
0       I'm cooked, chat Ventilation (self.adultery)        NEGATIVE   
1          Thank you chat Good Vibes (self.adultery)        POSITIVE   
2  I miss you so much - after 6 months and after ...        NEGATIVE   
3  Tinder video selfie verification? Hello IT? (s...        NEGATIVE   
4  AP is buying a house, mixed emotions Ventilati...        NEGATIVE   

   sentiment_score  
0         0.988388  
1         0.999230  
2         0.986825  
3         0.997121  
4         0.989145  


In [40]:
# Count the number of negative and positive labels
negative_count = juicy_confessions_sentiment_df[juicy_confessions_sentiment_df['sentiment_label'] == 'NEGATIVE'].shape[0]
positive_count = juicy_confessions_sentiment_df[juicy_confessions_sentiment_df['sentiment_label'] == 'POSITIVE'].shape[0]

print(f"\nNumber of negative confessions (with strong sentiment): {negative_count}")
print(f"Number of positive confessions (with strong sentiment): {positive_count}")


Number of negative confessions (with strong sentiment): 562
Number of positive confessions (with strong sentiment): 63


In [42]:
output_path = '/content/drive/MyDrive/Colab Notebooks/juicy_confessions_sentiment.csv'

# Save the DataFrame to a CSV file
juicy_confessions_sentiment_df.to_csv(output_path, index=False, encoding='utf-8-sig')
print(f"\nSuccessfully saved the juicy confessions sentiment DataFrame to: {output_path}")


Successfully saved the juicy confessions sentiment DataFrame to: /content/drive/MyDrive/Colab Notebooks/juicy_confessions_sentiment.csv
