# Sentiment Analysis on HappyDB Dataset

## Overview
This notebook performs comprehensive sentiment analysis on the HappyDB dataset, focusing on gender-based differences in emotional expression across different age groups.

## Research Questions
1. How do men and women express happiness differently?
2. Do sentiment words change with age?
3. What are the most common emotional words in happy moments?
4. Are there gender-specific patterns in emotional expression?

## Workflow
1. **Setup & Imports** - Load required libraries
2. **Data Loading & Preprocessing** - Load and clean datasets
3. **Text Processing Functions** - Define preprocessing and analysis functions
4. **Sentiment Analysis** - Apply VADER sentiment analysis
5. **Gender-based Analysis** - Compare male vs female sentiment patterns
6. **Visualization** - Generate comparison charts
7. **Results & Insights** - Analyze findings

## 1. Setup & Imports

Load all required libraries for data analysis, natural language processing, and visualization.

In [11]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
from collections import Counter
import os

# Natural Language Processing
import nltk
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
import statsmodels.api as sm

# Visualization
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Download required NLTK data
nltk.download('vader_lexicon', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


## 2. Data Loading & Preprocessing

Load the HappyDB dataset and demographic information, then merge and filter the data for analysis.


In [13]:
# Load datasets
print("📊 Loading datasets...")
demographic_df = pd.read_csv('./data/demographic.csv')
cleaned_hm_df = pd.read_csv('./data/cleaned_hm.csv')

print(f"Demographic data shape: {demographic_df.shape}")
print(f"Happy moments data shape: {cleaned_hm_df.shape}")

# Merge datasets
merged_data = pd.merge(cleaned_hm_df, demographic_df, on='wid', how='inner')
print(f"Merged data shape: {merged_data.shape}")

# Filter for US participants only
us_data = merged_data[merged_data['country'] == 'USA'].copy()
print(f"US participants: {len(us_data)}")

# Split by gender
male_data = us_data[us_data['gender'] == 'm'].copy()
female_data = us_data[us_data['gender'] == 'f'].copy()

print(f"Male participants: {len(male_data)}")
print(f"Female participants: {len(female_data)}")

# Save processed data
male_data.to_csv('./output/male_data.csv', index=False)
female_data.to_csv('./output/female_data.csv', index=False)

print("✅ Data loading and preprocessing completed!")


📊 Loading datasets...
Demographic data shape: (10844, 6)
Happy moments data shape: (100535, 9)
Merged data shape: (100535, 14)
US participants: 79063
Male participants: 44259
Female participants: 34100
✅ Data loading and preprocessing completed!


## 3. Text Processing Functions

Define all necessary functions for text preprocessing, sentiment analysis, and data processing.


In [14]:
# =============================================================================
# TEXT PREPROCESSING FUNCTIONS
# =============================================================================

def simple_preprocess(text):
    """
    Perform simple preprocessing on the given text.
    - Convert text to lowercase.
    - Remove non-alphabetic characters, keeping only letters and spaces.
    - Split text into individual words.
    
    Parameters:
    - text (str): The text to be preprocessed.
    
    Returns:
    - list: A list of preprocessed words from the text.
    """
    text = text.lower()  # Convert text into lowercase
    text = ''.join(char for char in text if char.isalpha() or char.isspace())  # Remove non-alphabetic characters
    words = text.split()  # Split text into words
    return words

print("✅ Text preprocessing function defined!")

def process_data(data_path, age_column, text_column, gender_prefix): # proces gender-specific data for analysis
    """
     Process gender-specific data including text preprocessing, sentiment analysis, and age normalization.
     Parameters:
     - data_path (str): Path to the CSV file containing the data.
     - age_column (str): Name of the column containing age information.
     - text_column (str): Name of the column containing text data to be processed.
     - gender_prefix (str): Prefix to distinguish between male and female data.
     Returns:
     - Tuple of processed results: most_common_filtered_words, sentiment_word_freq_by_valid_age
    """

    # data loading
    data_df = pd.read_csv(data_path)
    data_df[f'{gender_prefix}_processed_text'] = data_df[text_column].apply(simple_preprocess) # Text Preprocessing by converting text to lowercase, removing non-alphabetic characters, and tokenizing

    # Calculate word frequency and identify most common words
    all_words = [word for text in data_df[f'{gender_prefix}_processed_text'] for word in text]
    english_stopwords = set(stopwords.words('english'))
    filtered_words = [word for word in all_words if word not in english_stopwords]
    filtered_word_freq = Counter(filtered_words)
    most_common_filtered_words = filtered_word_freq.most_common(20)

    # Instantiate VADER SentimentIntensityAnalyzer: focus on sentiment words to understand emotional tone of text.
    sia = SentimentIntensityAnalyzer()
    vader_lexicon = sia.lexicon
    sentiment_words_from_filtered = [word for word in filtered_words if word in vader_lexicon]
    unique_sentiment_words = list(set(sentiment_words_from_filtered))

    # Applying age normalization
    data_df[f'normalized_age_{gender_prefix}'] = data_df[age_column].apply(normalize_age)
    valid_age_data_df = data_df.dropna(subset=[f'normalized_age_{gender_prefix}'])
    grouped_texts = valid_age_data_df.groupby(f'normalized_age_{gender_prefix}')[f'{gender_prefix}_processed_text']
    age_grouped_text_valid = grouped_texts.apply(lambda texts: ' '.join(' '.join(text) for text in texts))

    # Word frequency analysis with valid age groups
    word_freq_by_valid_age = {}

    for age, text in age_grouped_text_valid.items():
        words = text.split()
        filtered_words = [word for word in words if word not in english_stopwords]
        word_freq = Counter(filtered_words)
        word_freq_by_valid_age[age] = word_freq

    # Analyzing sentiment word frequency by age
    sentiment_word_freq_by_valid_age = pd.DataFrame({
        word: [word_freq_by_valid_age[age][word] for age in word_freq_by_valid_age] for word in unique_sentiment_words
    }, index=word_freq_by_valid_age.keys())

    return data_df, most_common_filtered_words, sentiment_word_freq_by_valid_age

# Ensure stopwords and VADER's lexicon are available
nltk.download('stopwords')
nltk.download('vader_lexicon')

# Age Normalization
def normalize_age(age):
    """
    Function to normalize age values.
    Returns None for non-numeric or invalid age values.
    """
    try:
        return int(float(age))
    except (ValueError, TypeError):
        return None

# Combine age normalization and basic stopwords
def process_data_for_age_with_basic_stopwords_corrected(df, text_column, age_column):
    """
    Process data to calculate word frequencies by age, excluding basic stopwords, and identify top 10 words by percentage.
    Correctly handles non-numeric age values by excluding them from the analysis.
    """
    # Apply simple preprocess and normalize age with exclusion of non-numeric values
    df['processed_text'] = df[text_column].apply(simple_preprocess)
    df['normalized_age'] = df[age_column].apply(lambda x: np.nan if not str(x).isdigit() else int(float(x)))

    # Drop rows with NaN ages
    valid_data_df = df.dropna(subset=['normalized_age'])

    # Aggregate texts by age
    aggregated_texts_by_age = valid_data_df.groupby('normalized_age')['processed_text'].agg(sum)

    # Calculate word frequencies and percentages
    word_freq_percentage_by_age = {}
    for age, texts in aggregated_texts_by_age.items():
        english_stopwords = set(stopwords.words('english'))
        filtered_words = [word for word in texts if word not in english_stopwords]
        word_freq = Counter(filtered_words)
        total_words = sum(word_freq.values())
        word_freq_percentage = {word: (count / total_words) * 100 for word, count in word_freq.items()}
        # Sort words by frequency percentage and get top 10
        top_10_words = sorted(word_freq_percentage.items(), key=lambda x: x[1], reverse=True)[:10]
        word_freq_percentage_by_age[age] = top_10_words

    return word_freq_percentage_by_age

# Extract top 10 words for each age
def extract_word_values(word_freq_data, target_word):
    """
    Extracts and returns the values associated with the target word for each age group
    in the provided data structure.

    Parameters:
    - word_freq_data: A dictionary with age groups as keys and lists of (word, value) tuples as values.
    - target_word: The word for which values are to be extracted across all age groups.

    Returns:
    - results_df: A pandas DataFrame with two columns: 'Age' and '{target_word} Value', where each row corresponds
      to an age group and its value for the target word. If the target word is not present, the value will be None.
    """
    # Initialize an empty dictionary to store the results
    results = {}
    
    # Iterate over each age group in the data
    for age, word_values in word_freq_data.items():
        # Initialize the value for the target word as None for each age group
        value_for_target_word = None
        
        # Search for the target word entry
        for word, value in word_values:
            if word == target_word:
                value_for_target_word = value
                break  # Stop searching once the target word is found
        
        # Assign the found value or None to the results dictionary
        results[age] = value_for_target_word

    # Convert the dictionary to a DataFrame and dynamically name the 'Value' column based on the target word
    results_df = pd.DataFrame(list(results.items()), columns=['Age', f'{target_word} Value'])

    # Return the DataFrame
    return results_df

def merge_word_values_by_age(word_freq_data, words):
    """
    Merges the values for a list of words across age groups into a single DataFrame.

    Parameters:
    - word_freq_data: A dictionary with age groups as keys and lists of (word, value) tuples as values.
    - words: A list of words to extract and merge values for.

    Returns:
    - merged_results_df: A pandas DataFrame with age groups as rows and each word's values as columns.
    """
    # Initialize an empty DataFrame to hold the merged results
    merged_results_df = pd.DataFrame()

    # Iterate over each word to extract its values and merge the results
    for word in words:
        # Extract values for the current word
        results_df = extract_word_values(word_freq_data, word)
        
        # Rename the column to reflect the current word's values
        results_df.rename(columns={f'{word} Value': f'{word}_value'}, inplace=True)
        
        # If it's the first word, initialize the merged DataFrame with the age and word's value
        if merged_results_df.empty:
            merged_results_df = results_df
        else:
            # For subsequent words, merge on 'Age' to ensure alignment across age groups
            merged_results_df = pd.merge(merged_results_df, results_df, on='Age', how='outer')

    # Return the final merged DataFrame
    return merged_results_df


# visualize the frequency of a specific word across different ages for male and female participants
def plot_word_frequency(word, merged_df, output_directory = '../output/word_frequency_compare/'):

    # Check if the word columns exist in the DataFrame, and if not, initialize them to 0
    if f'{word}_male' not in merged_df.columns:
        merged_df[f'{word}_male'] = 0
    if f'{word}_female' not in merged_df.columns:
        merged_df[f'{word}_female'] = 0
        
    # Filling NaN values with 0 for plotting purposes
    plot_data = merged_df[['Age', f'{word}_male', f'{word}_female']].fillna(0)
    
    # Setting the figure size for better visibility, making it wider
    plt.figure(figsize=(14, 6))  # Increased width for a wider chart
    
    # Creating the bar chart
    # Setting the position of bars on the X-axis
    bar_width = 0.35
    r1 = range(len(plot_data))
    r2 = [x + bar_width for x in r1]
    
    # Making the plot
    plt.bar(r1, plot_data[f'{word}_male'], color='b', width=bar_width, edgecolor='grey', label='Male')
    plt.bar(r2, plot_data[f'{word}_female'], color='r', width=bar_width, edgecolor='grey', label='Female')
    
    # Adding labels and title
    plt.xlabel('Age', fontweight='bold')
    plt.xticks([r + bar_width/2 for r in range(len(plot_data))], plot_data['Age'], rotation='vertical')  # Rotate x-axis labels vertically
    plt.ylabel(f'{word.capitalize()} Frequency', fontweight='bold')
    plt.title(f'Comparison of "{word.capitalize()}" Frequency by Age and Gender', fontweight='bold')  # Updated title
    
    # Creating legend & showing the plot
    plt.legend()
    plt.tight_layout()  # Adjust layout to not cut off labels

    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
            
    # Define the output path for saving the graph
    output_path = os.path.join(output_directory, f'{word}.jpg')
    
    # Save the plot to the specified output path
    plt.savefig(output_path)
    plt.close()  # Close the plot to avoid displaying it

✅ Text preprocessing function defined!


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jacksonzhao/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jacksonzhao/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### Age Normalization Function


In [15]:
def normalize_age(age):
    """
    Normalize age values by converting to integers and handling edge cases.
    
    Parameters:
    - age: Age value (can be string or numeric)
    
    Returns:
    - int or None: Normalized age or None if invalid
    """
    try:
        age_int = int(age)
        if age_int < 0 or age_int > 120:  # Reasonable age range
            return None
        return age_int
    except (ValueError, TypeError):
        return None

print("✅ Age normalization function defined!")


✅ Age normalization function defined!


### Main Data Processing Function


In [16]:
def process_data(data_path, age_column, text_column, gender_prefix):
    """
    Process gender-specific data including text preprocessing, sentiment analysis, and age normalization.
    
    Parameters:
    - data_path (str): Path to the CSV file containing the data.
    - age_column (str): Name of the column containing age information.
    - text_column (str): Name of the column containing text data to be processed.
    - gender_prefix (str): Prefix to distinguish between male and female data.
    
    Returns:
    - Tuple of processed results: most_common_filtered_words, sentiment_word_freq_by_valid_age
    """
    print(f"🔄 Processing {gender_prefix} data...")
    
    # Data loading
    data_df = pd.read_csv(data_path)
    data_df[f'{gender_prefix}_processed_text'] = data_df[text_column].apply(simple_preprocess)
    
    # Calculate word frequency and identify most common words
    all_words = [word for text in data_df[f'{gender_prefix}_processed_text'] for word in text]
    english_stopwords = set(stopwords.words('english'))
    filtered_words = [word for word in all_words if word not in english_stopwords]
    filtered_word_freq = Counter(filtered_words)
    most_common_filtered_words = filtered_word_freq.most_common(20)
    
    # Instantiate VADER SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()
    vader_lexicon = sia.lexicon
    sentiment_words_from_filtered = [word for word in filtered_words if word in vader_lexicon]
    unique_sentiment_words = list(set(sentiment_words_from_filtered))
    
    # Apply age normalization
    data_df[f'normalized_age_{gender_prefix}'] = data_df[age_column].apply(normalize_age)
    valid_age_data_df = data_df.dropna(subset=[f'normalized_age_{gender_prefix}'])
    grouped_texts = valid_age_data_df.groupby(f'normalized_age_{gender_prefix}')[f'{gender_prefix}_processed_text']
    age_grouped_text_valid = grouped_texts.apply(lambda texts: ' '.join(' '.join(text) for text in texts))
    
    # Word frequency analysis with valid age groups
    word_freq_by_valid_age = {}
    for age, text in age_grouped_text_valid.items():
        words = text.split()
        filtered_words = [word for word in words if word not in english_stopwords]
        word_freq = Counter(filtered_words)
        word_freq_by_valid_age[age] = word_freq
    
    # Analyzing sentiment word frequency by age
    sentiment_word_freq_by_valid_age = pd.DataFrame({
        word: [word_freq_by_valid_age[age][word] for age in word_freq_by_valid_age] 
        for word in unique_sentiment_words
    })
    
    print(f"✅ {gender_prefix} data processing completed!")
    return most_common_filtered_words, sentiment_word_freq_by_valid_age

print("✅ Main data processing function defined!")


✅ Main data processing function defined!


### Visualization Functions


In [17]:
def plot_word_frequency(male_freq, female_freq, word, output_dir='../figs/word_frequency_compare'):
    """
    Create and save a bar chart comparing word frequency between male and female data.
    
    Parameters:
    - male_freq (dict): Male word frequency data
    - female_freq (dict): Female word frequency data  
    - word (str): The word to analyze
    - output_dir (str): Directory to save the plot
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Extract data for the specific word
    male_data = [male_freq[age].get(word, 0) for age in sorted(male_freq.keys())]
    female_data = [female_freq[age].get(word, 0) for age in sorted(female_freq.keys())]
    ages = sorted(male_freq.keys())
    
    # Create the plot
    plt.figure(figsize=(12, 6))
    x = np.arange(len(ages))
    width = 0.35
    
    plt.bar(x - width/2, male_data, width, label='Male', alpha=0.8)
    plt.bar(x + width/2, female_data, width, label='Female', alpha=0.8)
    
    plt.xlabel('Age Group')
    plt.ylabel('Word Frequency')
    plt.title(f'Word Frequency Comparison: "{word}"')
    plt.xticks(x, ages)
    plt.legend()
    plt.tight_layout()
    
    # Save the plot
    plt.savefig(f'{output_dir}/{word}_comparison.jpg', dpi=300, bbox_inches='tight')
    plt.close()
    
    print(f"📊 Chart saved: {word}_comparison.jpg")

print("✅ Visualization functions defined!")


✅ Visualization functions defined!


## 4. Sentiment Analysis Execution

Run the sentiment analysis on both male and female datasets to identify patterns and differences.


In [18]:
# Process male data
print("🚹 Processing male data...")
male_common_words, male_sentiment_freq = process_data(
    './output/male_data.csv', 
    'age', 
    'cleaned_hm', 
    'male'
)

# Process female data  
print("🚺 Processing female data...")
female_common_words, female_sentiment_freq = process_data(
    './output/female_data.csv', 
    'age', 
    'cleaned_hm', 
    'female'
)

print("✅ Sentiment analysis completed for both genders!")


🚹 Processing male data...
🔄 Processing male data...
✅ male data processing completed!
🚺 Processing female data...
🔄 Processing female data...
✅ female data processing completed!
✅ Sentiment analysis completed for both genders!


## 5. Gender Comparison Analysis

Compare sentiment word frequencies between male and female participants across different age groups.


In [19]:
# Convert sentiment frequency data to percentage
print("📊 Converting frequencies to percentages...")

# Male data processing
male_sentiment_percentages = male_sentiment_freq.copy()
for col in male_sentiment_percentages.columns:
    total = male_sentiment_percentages[col].sum()
    if total > 0:
        male_sentiment_percentages[col] = (male_sentiment_percentages[col] / total) * 100

# Female data processing  
female_sentiment_percentages = female_sentiment_freq.copy()
for col in female_sentiment_percentages.columns:
    total = female_sentiment_percentages[col].sum()
    if total > 0:
        female_sentiment_percentages[col] = (female_sentiment_percentages[col] / total) * 100

print("✅ Percentage conversion completed!")


📊 Converting frequencies to percentages...
✅ Percentage conversion completed!


In [20]:
# Merge male and female data for comparison
print("🔄 Merging male and female sentiment data...")

# Get common sentiment words
common_sentiment_words = list(set(male_sentiment_percentages.columns) & set(female_sentiment_percentages.columns))
print(f"Found {len(common_sentiment_words)} common sentiment words")

# Create comparison dataframe
comparison_data = []
for word in common_sentiment_words:
    for age in male_sentiment_percentages.index:
        comparison_data.append({
            'word': word,
            'age': age,
            'male_percentage': male_sentiment_percentages.loc[age, word],
            'female_percentage': female_sentiment_percentages.loc[age, word]
        })

comparison_df = pd.DataFrame(comparison_data)
print(f"Created comparison dataframe with {len(comparison_df)} rows")
print("✅ Data merging completed!")


🔄 Merging male and female sentiment data...
Found 1379 common sentiment words
Created comparison dataframe with 79982 rows
✅ Data merging completed!


## 6. Visualization Generation

Generate comparison charts for each sentiment word to visualize gender differences across age groups.


In [24]:
# Generate visualization charts
print("📊 Generating comparison charts...")

# Create output directory
os.makedirs('./output/word_frequency_compare', exist_ok=True)

# Generate charts for each sentiment word
charts_created = 0
for word in common_sentiment_words[:50]:  # Limit to first 50 words for demo
    try:
        # Prepare data for plotting
        word_data = comparison_df[comparison_df['word'] == word]
        
        if len(word_data) > 0:
            # Create the plot
            plt.figure(figsize=(12, 6))
            ages = word_data['age'].values
            male_pct = word_data['male_percentage'].values
            female_pct = word_data['female_percentage'].values
            
            x = np.arange(len(ages))
            width = 0.35
            
            plt.bar(x - width/2, male_pct, width, label='Male', alpha=0.8, color='skyblue')
            plt.bar(x + width/2, female_pct, width, label='Female', alpha=0.8, color='lightcoral')
            
            plt.xlabel('Age Group')
            plt.ylabel('Percentage (%)')
            plt.title(f'Word Frequency Comparison: "{word}"')
            plt.xticks(x, ages)
            plt.legend()
            plt.tight_layout()
            
            # Save the plot
            plt.savefig(f'./output/word_frequency_compare/{word}_comparison.jpg', 
                       dpi=300, bbox_inches='tight')
            plt.close()
            
            charts_created += 1
            
    except Exception as e:
        print(f"Error creating chart for '{word}': {e}")
        continue

print(f"✅ Generated {charts_created} comparison charts!")
print("📁 Charts saved to: ./output/word_frequency_compare/")


📊 Generating comparison charts...
✅ Generated 50 comparison charts!
📁 Charts saved to: ./output/word_frequency_compare/


## 7. Results & Insights

Analyze the findings from the sentiment analysis and identify key patterns.


In [25]:
# Display summary statistics
print("📈 SENTIMENT ANALYSIS SUMMARY")
print("=" * 50)

print(f"📊 Dataset Overview:")
print(f"   • Total US participants: {len(us_data)}")
print(f"   • Male participants: {len(male_data)}")
print(f"   • Female participants: {len(female_data)}")
print(f"   • Common sentiment words analyzed: {len(common_sentiment_words)}")

print(f"\n📊 Most Common Words (Male):")
for word, count in male_common_words[:10]:
    print(f"   • {word}: {count}")

print(f"\n📊 Most Common Words (Female):")
for word, count in female_common_words[:10]:
    print(f"   • {word}: {count}")

print(f"\n📊 Analysis Results:")
print(f"   • Comparison charts generated: {charts_created}")
print(f"   • Age groups analyzed: {len(male_sentiment_percentages.index)}")
print(f"   • Sentiment words per gender: {len(male_sentiment_percentages.columns)}")

print("\n✅ Sentiment analysis workflow completed successfully!")
print("📁 Check the '../figs/word_frequency_compare/' directory for visualization charts.")


📈 SENTIMENT ANALYSIS SUMMARY
📊 Dataset Overview:
   • Total US participants: 79063
   • Male participants: 44259
   • Female participants: 34100
   • Common sentiment words analyzed: 1379

📊 Most Common Words (Male):
   • got: 6137
   • happy: 5709
   • made: 4826
   • work: 4205
   • new: 4091
   • went: 3515
   • time: 3388
   • able: 2662
   • good: 2636
   • day: 2596

📊 Most Common Words (Female):
   • happy: 6381
   • made: 4568
   • got: 4563
   • time: 2899
   • went: 2870
   • new: 2695
   • work: 2670
   • day: 2543
   • husband: 2169
   • able: 2088

📊 Analysis Results:
   • Comparison charts generated: 50
   • Age groups analyzed: 58
   • Sentiment words per gender: 1801

✅ Sentiment analysis workflow completed successfully!
📁 Check the '../figs/word_frequency_compare/' directory for visualization charts.
