<a href="https://colab.research.google.com/github/YamaMaki/aiasuka-data-project/blob/main/The_AiAsuka_Effect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Aiasuka Effect: A Case Study on Gender Bias in Web3

## 1. Objective & Hypothesis

This case study investigates how gender identity—specifically a synthetic feminine persona—affects engagement and follower growth on X in the male-dominated Web3 space. By deploying "Aiasuka," an AI-generated persona, and bookending the experiment with authentic identity posts, it examines gender bias, parasocial attachment, and the ethical risks of synthetic personas displacing real female voices.

An AI-generated feminine persona will significantly increase engagement and follower growth through parasocial attraction and gender bias, even if suspected to be artificial. Reverting to an authentic male identity will reveal the fragility of these bonds, highlighting perceived gender’s dominance over authentic identity.

## 2. Data Loading & Cleaning

In this section, we load the raw data from a multi-tab Google Sheet. We then perform a series of data cleaning and standardization steps—including cleaning column names, consolidating redundant columns, and handling missing values—to create a single, unified DataFrame that is ready for analysis.

In [None]:
import pandas as pd
google_sheet_url = 'https://docs.google.com/spreadsheets/d/1TW7VnSh1zak52d5IpG7pUgxwjK1VJAvEVaPb918O_dE/export?format=xlsx'

# read all tabs into a dictionary of DataFrames
all_tabs = pd.read_excel(google_sheet_url, sheet_name=None)

# Combine all DataFrames into one master DataFrame
master_df = pd.concat(all_tabs.values(), ignore_index=True)

# Print the combined DataFrame
print("--- Successfully loaded and combined DataFrame ---")
print(master_df.head())
print("\n--- Checking the 'Phase' column values ---")
print(master_df['Phase'].value_counts())

In [None]:

print("--- 1. Overall DataFrame Info ---")
# This will show us all the column names, and how many non-null values are in each.
master_df.info()

print("\n\n--- 2. Count of Missing Values per Column ---")
# This gives a direct count of how many NaN values are in each column.
# We expect to see a lot here because of the different column structures.
print(master_df.isnull().sum())

print("\n\n--- 3. Let's Look at the Column Names ---")
# This will show us all the column names so we can spot inconsistencies.
print(master_df.columns)

In [None]:
print("--- Original Column Names ---")
print(master_df.columns)

#we will now go and apply a series of cleaning steps to each column name
new_columns = master_df.columns
new_columns = new_columns.str.lower() # 1. convert to lowercase
new_columns = new_columns.str.replace(' ', '_') # 2. Replace spaces with underscores
new_columns = new_columns.str.replace('(y/n)', '', regex=False) #3. remove (y/n)
new_columns = new_columns.str.replace('(%)', '_pct', regex=False) # 4. replace (%) with _pct
new_columns = new_columns.str.replace('.', '_', regex=False) #5. replace . with _
master_df.columns = new_columns

print("\n--- Updated Column Names ---")
print(master_df.columns)

print("\n--- DataFrame Head with Clean Columns")
print(master_df.head())

In [None]:
# Let's see what's inside the two post_type columns

print("--- Investigating 'post_type' column ---")
print(master_df['post_type'].value_counts(dropna=False))

print("\n\n--- Investigating 'post_type_1' column ---")
print(master_df['post_type_1'].value_counts(dropna=False))

In [None]:
print("--- Investigating ALL Post Category Columns ---")

print("\n--- Column: 'post_type' (Cleaned) ---")
# We already cleaned this one, so the output should look good
print(master_df['post_type'].value_counts(dropna=False))

print("\n--- Column: 'post_format' ---")
# This is the one we just discovered from the other sheet
print(master_df['post_format'].value_counts(dropna=False))

print("\n--- Column: 'post_type_1' ---")
# This was the other one we found
print(master_df['post_type_1'].value_counts(dropna=False))

In [None]:
# --- Complete Data Unification Script ---

import numpy as np
import pandas as pd

# Step 1: Coalesce the two descriptive columns ('post_format' and 'post_type_1')
# This creates a new column, filling NaN values in the first with values from the second.
master_df['temp_category'] = master_df['post_format'].combine_first(master_df['post_type_1'])

print("--- Step 1 Complete: Combined the two descriptive columns. ---")


# Step 2: Standardize the new combined category column
# We write a custom function to clean up all the messy, inconsistent text.

def standardize_post_type(category):
    # First, handle any potential missing values that might still exist
    if pd.isnull(category):
        return 'Unknown'

    # Convert text to lowercase to handle case inconsistencies (e.g., 'Text post' vs 'text post')
    cat_lower = str(category).lower()

    # Now, check for keywords to group similar post types together
    if 'gm' in cat_lower:
        return 'GM Post'
    elif 'gn' in cat_lower:
        return 'GN Post'
    elif 'quote' in cat_lower:
        return 'Quote Post'
    elif 'text' in cat_lower:
        return 'Text Post'
    elif 'space' in cat_lower:
        return 'X Space'
    elif 'ga' in cat_lower:
        return 'GA Post'
    # This is a catch-all for any other specific types we didn't define a rule for
    else:
        return 'Misc Post'

# Use the .apply() method to run our custom function on every row of the 'temp_category' column
master_df['post_category'] = master_df['temp_category'].apply(standardize_post_type)

print("--- Step 2 Complete: Created a single, clean 'post_category' column. ---")


# Step 3: Final Cleanup of the DataFrame
# Now that we have our master 'post_category', we can drop the old, messy columns to tidy up.
columns_to_drop = ['post_type', 'post_format', 'post_type_1', 'temp_category']
master_df = master_df.drop(columns=columns_to_drop)

print("--- Step 3 Complete: Dropped the old, messy columns. ---")


# Step 4: Verification
# Let's look at the final results to confirm our work.
print("\n--- FINAL VERIFICATION ---")
print("\nValue Counts for our clean 'post_category' column:")
print(master_df['post_category'].value_counts())

print("\nHead of our final, cleaned DataFrame:")
print(master_df.head())



In [None]:
master_df = master_df.drop('notes', axis=1)
print(master_df.head())

In [None]:
# Identify the categorical columns that still have missing values
# Based on our diagnostic, these are the main ones (excluding 'post_text' for now)
cols_to_fill = [
    'image_attached_',
    'control',
    'image_theme',
    'ai_asuka_involved',
    'in-group_activity_present',
    'assumed_poster',
    'comment_summary'
]

# Loop through the list of columns and fill any missing values with 'Unknown'
for col in cols_to_fill:
    if col in master_df.columns: # A safe check to make sure the column exists
        master_df[col] = master_df[col].fillna('Unknown')

# --- Final Verification ---
# Let's run our missing value check one last time to confirm our work.
print("--- Final Missing Value Check ---")
print(master_df.isnull().sum())

## 3. Exploratory Analysis & Visualizations

### 3.1 Average Impressions per Phase

To get a high-level overview of the experiment's impact, we first analyze the average number of impressions per post for each of the three phases. The bar chart below clearly shows a significant spike in impressions during the 'Persona' phase, with a sustained "halo effect" into the 'Post' phase.

In [None]:
# preparing data for our specific plot
# we will group our clean master_df by phase and calculate the mean of the impressions
phase_impressions = master_df.groupby('phase')['impressions'].mean().reset_index()
print(phase_impressions)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# import in seaborn and its supporting library for visualization

# we are going to make sure the names in our summary table are correct
# (they might have been cleaned again during the load process)
print(phase_impressions['phase'].unique())

#create a list that defines the correct chronological order
phase_order = ['Control', 'Persona', 'Post']

#create the bar plot with the correct order of phases
sns.catplot(x='phase', y='impressions', data=phase_impressions, kind='bar', order=phase_order)

#adding a clear, story-driven title
plt.suptitle('Average Impressions Spiked During Persona Phase', y=1.03)

#adding the axis labels for a clearer story
plt.xlabel('Experiment Phase')
plt.ylabel('Average Impressions per Post')

plt.show()

### 3.2 Relationship Between Impressions and Engagements by Phase

To understand the dynamics of engagement, we can plot impressions versus total engagements for each post. The faceted scatter plot below shows this relationship for each of the three experiment phases. We can observe a generally positive correlation, but the density and spread of the data points differ noticeably across the phases, particularly in the 'Persona' phase which saw higher-impression posts.

In [None]:
# we are making a relplot that shows the relationship between impressions and
# engagements per phase

#create a list that defines the correct chronological order
phase_order = ['Control', 'Persona', 'Post']

# setting up the palette for the scatter plot for easier visualization
sns.set_palette('plasma')


# create the rel plot with the correct order of phases, impressions and
# engagements
sns.relplot(x='impressions', y='engagements', data=master_df, kind='scatter', col_order=phase_order, col='phase',alpha=0.5)

#adding a clear, story-driven title
plt.suptitle('Engagements Followed Impressions across all 3 phases ', y=1.03)

plt.show()

### 3.3 Deeper Dive into Engagement Rates

To understand the nuances of the engagement rate, a heatmap allows us to compare the average rates across every post category and experiment phase simultaneously. The heatmap below reveals a key insight: while the 'Persona' phase drove higher impressions, the 'Control' phase actually had the highest engagement rates for its core post types (GM/GN), indicating a more dedicated, albeit smaller, audience.

In [None]:
# (Assuming master_df is our clean DataFrame from the previous steps)

print("--- Preparing Data for the Heatmap ---")

# First, let's make sure our phases are in the correct chronological order
# We can do this by converting the 'phase' column to a categorical data type
phase_order = ['Control', 'Persona', 'Post']
master_df['phase'] = pd.Categorical(master_df['phase'], categories=phase_order, ordered=True)

# Now, pivot the data to create the grid format needed for a heatmap
heatmap_pivot = master_df.pivot_table(
    values='engagement_rate__pct',  # The values to fill the grid
    index='post_category',         # The rows of the grid
    columns='phase'                # The columns of the grid
)

# 1. Define a more logical order for our post categories (the rows)
category_order = ['GM Post', 'GN Post', 'GA Post', 'Text Post', 'Quote Post', 'X Space', 'Misc Post']

# 2. Re-order the pivot table's index to match our logical order
heatmap_pivot = heatmap_pivot.reindex(category_order)

# 3. Fill any missing values with 0 for a cleaner visual
heatmap_pivot_clean = heatmap_pivot.fillna(0)


# --- The New, Improved Heatmap ---

# Make the figure a bit bigger to give it space
plt.figure(figsize=(10, 7))

# Plot the cleaned and re-ordered pivot table
sns.heatmap(data=heatmap_pivot_clean,
            annot=True,
            fmt=".1f",
            cmap='viridis')

# Add a clearer title and labels
plt.title('Average Engagement Rate (%) by Phase and Post Type', fontsize=16)
plt.xlabel('Experiment Phase')
plt.ylabel('Post Category')

## 4. Follower Growth Analysis

One of the primary outcomes of the experiment was the rapid follower growth observed during the 'Persona' phase. This section analyzes the follower trend over the entire observation period. The data, based on snapshots from X Analytics and later corrected, shows that the vast majority of the account's growth occurred during the 7-day 'Persona' phase.

In [None]:
# the URL to the raw CSV file on Github
github_url = 'https://raw.githubusercontent.com/YamaMaki/aiasuka-data-project/refs/heads/main/follower_growth%20-%20Sheet1.csv'

#allow pandas to read the csv from the url
followers_df = pd.read_csv(github_url)

print("successfully loaded data from Github!")
print(followers_df.head())

# Convert 'Date' column to datetime objects for proper plotting
followers_df['Date'] = pd.to_datetime(followers_df['Date'])

# Convert 'Followers' column to a numeric type (float or int)
followers_df['Followers'] = followers_df['Followers'].str.replace(',', '').astype(float)

# --- The Investigation ---
# We are checking for columns and values inside to make sure that its cleaned
print("Columns found in your CSV file:")
print(followers_df.columns)


sns.set_style('darkgrid')
sns.set_palette('viridis')

sns.relplot(x='Date', y='Followers', data=followers_df, kind='line', marker='o')
plt.title('Follower Growth Over AiAsuka Experiment')
plt.xlabel('Date')
plt.ylabel('Number of Followers')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 5. Statistical Validation

To ensure the observed differences in the data were not due to random chance, a series of statistical tests were performed. We used independent t-tests to compare the mean engagement rates between pairs of phases and a one-way ANOVA to compare the mean impressions across all three phases. The significance level (alpha) was set at 0.05.

In [None]:
from scipy.stats import ttest_ind

print("--- Running Full Statistical Validation on Engagement Rates ---")

# --- 1. Prepare Data Samples for All Phases ---
control_rates = master_df[master_df['phase'] == 'Control']['engagement_rate__pct']
persona_rates = master_df[master_df['phase'] == 'Persona']['engagement_rate__pct']
post_rates = master_df[master_df['phase'] == 'Post']['engagement_rate__pct']

# --- 2. Run All Three T-Tests ---
stat_cvp, p_value_cvp = ttest_ind(control_rates, persona_rates)
stat_pvp, p_value_pvp = ttest_ind(persona_rates, post_rates)
stat_cvsp, p_value_cvsp = ttest_ind(control_rates, post_rates)

# --- 3. Print a Clean Summary of Results ---
print("\n--- T-Test Results Summary (alpha = 0.05) ---")

# Comparison 1: Control vs. Persona
print(f"\n1. Control Phase vs. Persona Phase:")
print(f"   P-value: {p_value_cvp:.4f}")
if p_value_cvp < 0.05:
    print("   Result: The difference IS statistically significant.")
else:
    print("   Result: The difference IS NOT statistically significant.")

# Comparison 2: Persona vs. Post
print(f"\n2. Persona Phase vs. Post Phase:")
print(f"   P-value: {p_value_pvp:.4f}")
if p_value_pvp < 0.05:
    print("   Result: The difference IS statistically significant.")
else:
    print("   Result: The difference IS NOT statistically significant.")

# Comparison 3: Control vs. Post
print(f"\n3. Control Phase vs. Post Phase:")
print(f"   P-value: {p_value_cvsp:.4f}")
if p_value_cvsp < 0.05:
    print("   Result: The difference IS statistically significant.")
else:
    print("   Result: The difference IS NOT statistically significant.")


In [None]:
from scipy.stats import f_oneway

# --- Prepare the Data Samples for the ANOVA Test ---

# Create a Series containing all impressions from the 'Control' phase
control_impressions = master_df[master_df['phase'] == 'Control']['impressions']

# Create a Series containing all impressions from the 'Persona' phase
persona_impressions = master_df[master_df['phase'] == 'Persona']['impressions']

# Create a Series containing all impressions from the 'Post' phase
post_impressions = master_df[master_df['phase'] == 'Post']['impressions']


# --- Verify our samples ---
print("--- Data Ready for ANOVA Test ---")
print(f"Number of 'Control' samples: {len(control_impressions)}")
print(f"Average Control Impressions: {control_impressions.mean():.0f}")

print(f"\nNumber of 'Persona' samples: {len(persona_impressions)}")
print(f"Average Persona Impressions: {persona_impressions.mean():.0f}")

print(f"\nNumber of 'Post' samples: {len(post_impressions)}")
print(f"Average Post Impressions: {post_impressions.mean():.0f}")

#run the ANOVA test:
f_oneway(control_impressions, persona_impressions, post_impressions)


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# --- 1. Load the Data ---
followers_df = pd.read_csv('follower_growth - Sheet1.csv')

# --- 2. Initial Inspection ---
print("--- Follower Data Loaded Successfully ---")
print("First 5 rows:")
print(followers_df.head())

print("\n--- Data Info ---")
# Let's check the column names and data types (Dtypes)
followers_df.info()

### Summary Statistics Table

To provide a clear, high-level overview of the key performance indicators for each phase, the following summary table was generated from the clean dataset. This table serves as the primary source of truth for the metrics cited throughout this analysis.

In [None]:
# this code block will calculate the key summary stats for each phase
# First, we will group by the 'phase column and aggregate our metrics
phase_summary = master_df.groupby('phase').agg(
    total_impressions=('impressions', 'sum'),
    total_engagements=('engagements', 'sum'),
    average_engagement_rate=('engagement_rate__pct', 'mean')
)

# we are going to format the engagement rate to be a percentage, rounded to 2
# decimal places
phase_summary['average_engagement_rate'] = (phase_summary['average_engagement_rate']).round(2)

print("--- Validation of Key Metrics from Cleaned Data ---")
print(phase_summary)

## 6. Conclusion & Final Thoughts

### Key Findings

This analysis, validated with a cleaned and unified dataset, confirms the core findings of the Aiasuka experiment:

* **Finding 1:** The introduction of the synthetic "Persona" generated a massive, statistically significant increase in **impressions** and **follower growth**, proving the effectiveness of this strategy for capturing audience attention in the short term.
* **Finding 2:** However, this growth in reach came at the cost of engagement *quality*. The average **engagement rate** was highest in the authentic "Control" phase, and the difference between the Control and Persona phases was not statistically significant.
* **Finding 3:** The end of the experiment in the "Post" phase caused a severe, statistically significant **drop in engagement rates**, demonstrating the fragile and unsustainable nature of the parasocial bonds formed with the synthetic persona.

### Ethical Implications & The "Synthetic Femme" Problem

7. Insights
7.1 Gender Bias Amplifies Virality
Aiasuka’s +47% Impression spike (15,069 vs. 10,255) and ~700+ follower gain, driven by selfies and $SOL giveaways (e.g., April 22 GM, 33.91%), reflect gender bias. Persona’s 26.70% Engagement Rate is lower than Control’s 32.21% (e.g., 69.52% GM), but GM (39.13%) and GN (37.75%) outperform Control’s Misc (28.07%), showing synthetic femininity’s strength.

7.2 Parasocial Drift Spectrum
Persona-phase replies (e.g., “goodmorning babe”) showed emotional investment, with flirty, supportive, and begging comments (e.g., April 22 GM, 79 Replies). Control’s neutral replies (e.g., April 15 GM, 31 Replies) contrast with Post-phase’s superficial affirmations (e.g., “Facts 💯”). The April 23 bookend (9.18%, 72 Replies) reflects shock (e.g., “SO UR NOT A GIRL”), with ~60% of Persona followers disengaging emotionally.

7.3 Engagement Trade-Off
Control’s 32.21% Engagement Rate (e.g., 44.20% Misc “story in 3 images”) reflects authentic Web3 appeal. Persona’s 26.70% (e.g., 40.41% GN) and Post’s 16.91% (e.g., 34.30% GM) show a trade-off: Aiasuka’s GM (39.13%) outperformed Control’s Misc (28.07%), but Post-phase GM (24.12%) and “Facts 💯” replies lack emotional depth.

7.4 Halo Effect
Post-phase Impressions (384.3 avg, 3247 max for bookend) and ~250 follower gain confirm Aiasuka’s lasting visibility. The decline from 26.70% to 16.91% Engagement Rate and from Persona GM (39.13%) to Post GM (24.12%) shows fading emotional bonds, unlike Control’s consistent 32.21%.

7.5 Authenticity vs. Perception
Bookends (17.39%, 9.18%) underperformed Control (32.21%) and Persona (26.70%). Control’s Misc (28.07%) outperforms Persona’s Misc (20.33%) and Post’s Misc (14.27%), but slower Post-phase follower growth (~250 vs. ~700) confirms synthetic femininity’s dominance.

8. Post-Reveal Dynamics
The Post phase shows a trajectory of disengagement:
Immediate Shock (April 23): The bookend (9.18%, 72 Replies) reflects surprise (“SO UR NOT A GIRL” @MxmetaX) and positivity (“Handsome”). Lower Engagement Rate vs. Persona’s 26.70% confirms Aiasuka’s pull.

Early Post-Phase (April 24–25): GM/GN posts (e.g., April 25 GM, 34.30%, 48 Replies; April 24 GN, 21.94%, 20 Replies) show strong engagement, driven by $SOL giveaways (e.g., April 24 GM, 24.80%, 66 Replies) and motivational content (e.g., April 25 Misc, 27.63%).

Mid Post-Phase (April 26–27): GM/GN posts remain strong (e.g., April 27 GM, 26.95%, 29 Replies; April 27 GN, 21.64%, 21 Replies), but Misc posts vary (e.g., April 27 Misc, 22.73%, 9 Replies). Spiritual (e.g., April 27 Misc “Psalm 27:1”) and community posts (@Teatimemeta) resonate, but lack Persona’s intensity (e.g., April 22 GN, 40.41%).

Giveaway Boost: Solana/Doge giveaways sustained engagement, but superficial replies signal limited connection.

Trust Erosion: Skeptical and detached replies, with slower follower growth (~250 vs. ~700), highlight trust erosion risks. Community ties (@Teatimemeta, #Dogwarts) maintain moderate engagement.

9. Ethical Warning: The Synthetic Femme Problem
Aiasuka’s 15,069 Impressions, 26.70% Engagement Rate, and ~700+ follower gain dwarf female creator benchmarks (5,000–8,000 Impressions, 10–15% Engagement Rates). Control’s 32.21% shows authentic appeal, but Post’s 16.91% and “Facts 💯” replies highlight risks:
Attention Displacement: Synthetic personas dominate Web3’s attention economy.

Trust Erosion: Detached replies (e.g., “Preach that wisdom king”) and shock (e.g., “catfish!!! LOL”) suggest synthetic personas exploit emotional investment.

Stereotype Reinforcement: Aiasuka’s idealized selfies perpetuate unattainable femininity.

Algorithmic Bias: Persona’s 5.26% Replies and ~60% follower growth amplified synthetic content, while Post’s 4.33% reflects reduced favor.

10. Mitigation Tactics
Amplify real female creators via X Spaces or NFT collaborations, leveraging tools like Metaplex.

Require AI personas to disclose artificiality.

Advocate for X algorithmic audits.

Educate Web3 on parasocial risks.

11. Application for Brands
Short-Term Visibility: Synthetic personas boost Impressions (15,069) and followers (~700 in 7 days).

Hybrid Approach: Pair AI with real creators, as seen in April 27 Misc’s influencer mentions.

Metrics-Driven: Prioritize Replies (e.g., 79 for April 22 GM, 66 for April 24 GM).

Long-Term Trust: Post’s 16.91% Engagement Rate and “algorithmic fill” replies show synthetic personas’ unsustainability, unlike Control’s 32.21%.

### Final Thought

The Aiasuka experiment highlights a critical tension in our increasingly digital world, demonstrating that while audiences may be drawn to idealized synthetic personas, the resulting connections are fragile and raise profound ethical questions about authenticity and trust online.